Look out Hadoop, there is a new/old kid in town who promises to handle the big data problem better than you can. HPCC (High Performance Computing Cluster) Systems from LexisNexis has been evolving and growing for over 10 years in the pressure cooker environment of LexisNexis. Handling terabytes and petabytes of data, HPCC has been honed to handle the biggest data needs. Now the engine that runs one of the biggest data jobs in the world is being open sourced by LexisNexis and made available to everyone. I had a chance to sit down today with Armando Escalante, Senior Vice President and Chief Technology Officer, LexisNexis Risk Solutions to discuss this further.
Living and working in Boca Raton, Florida, it is not often that I get to meet with the subjects of my articles in person, unless I am at a tech conference. But lo and behold, NexisLexis has a major data center and office right here in Boca and Armando is based here. So I got the full tour and actually met several key members of the HPCC team. Besides Armando, I met with David Bayliss, Chief Data Scientist and "father" of ECL (Enterprise Control Language) which he co-developed with Gavin Halliday and which HPCC runs on. I also met Stu Ort, Director Software Engineering, David Hof, Director of Business Development and Kristina Grammatico, Director of Public Relations.
I had a full tour of the LexisNexis data center here where HPCC has been handling the demanding big data needs of the company for years. So after all these years of using HPCC for their own supercomputer data handling needs, why has LexisNexis decided to release their engine? Simple, they realize what many in the open source community already knew. By opening up the code, the continued development and evolution of HPCC will be accelerated.
Armando and his team have watched for the past 3 or 4 years as Hadoop has continued to make progress. At first they didn’t think it would amount to much. But with the support of the community, Hadoop has made tremendous progress. It is not nearly as mature as HPCC yet according to Escalante, but David Bayliss and Stu Ort saw that if LexisNexis didn’t do something in a few years it could surpass HPCC. With all of the years of work and the millions of dollars in resources sunk into HPCC, Armando knew that he had to open HPCC up to compete and keep its edge. Plus Escalante says competition is the American way. With choice in the market between Hadoop and HPCC, each solution will have to evolve and grow to be successful. Armando welcomes the competition. He has been riding his horse for a long time and he knows he has a winner. So does the rest of the HPCC team.
This is not some new venture funded start up. LexisNexis is a major company with some of the biggest data needs in the world. The team has developed and continued to refine HPCC to exceed those demands. They are confident it already does and will handle the biggest data jobs. HPPC is actually made of several components that the team had developed over the years. The two main parts are Thor and Roxie. Thor is the engine and is the direct Hadoop equivalent. Roxie delivers the data. The entire project runs the ECL language. There are various other modules that Ort and Richard Chapman have developed as well. Overall HPCC is a rich environment that is battle tested. There is much more to the technology which you can read at the web site. For instance there is a Roxie ECO IDE graphical user interface. For a good comparison of HPCC to Hadoop you can click here. There are Roxie Pipes which will let inter-cloud communication work.
You can tell HPCC has been put through its paces. As Armando told me, “we may be new to the open source software world, but we are not new to big data”. HPCC is responsible for 90% of the multi-billions of dollars in revenue that LexisNexis generates.
While offering the open sourced community version of the product, HPCC Systems also offers a commercially licensed version that includes services, support, hosting options and other modules not available in the free version. The company thinks that HPCC will in and of itself become a major product with both private and public sector customers.
It already handles a half a trillion records every 8 hours or so. With that kind of documented performance who is going to argue? So now the gauntlet is thrown down, may the best big data solution win.