The 2nd annual Hadoop World conference is taking place in New York on October 12, 2010. To date, more than 28 presentations are planned (see agenda) including talks from eBay, Twitter, GE, Bank of America, Yahoo!, Facebook, Digg, HP and more. Tim O'Reilly, Founder and CEO of O'Reilly Media will be providing the keynote.
I sat down with my colleague Rod Cope, CTO and Founder of OpenLogic, to chat about Hadoop, Hadoop World, and his planned presentation at this year's event. By way of context, OpenLogic has been using Hadoop in full 24x7 production mode for almost a year and have processed billions of requests against their Hadoop clusters.
Q: Describe how you are using Hadoop.
Cope: It's easy to understand why large social and consumer applications, such as Facebook and eBay, have to manage big data. However, many people don't realize that enterprise applications can also have big data challenges.
Hadoop and HBase power OpenLogic's scanning technology used to find open source software that has been used in a software product or application. To do this, we use Hadoop and HBase to store and analyze every line of code of every version of every open source package in the world (hundreds of thousands of packages and billions of lines of code). We do a lot of batch processing to load new packages and calculate metrics, but we also respond to on-line random read requests across the 150TB+ production Hadoop cluster. In the on-line use case, we require response times measured in milliseconds and often process thousands of requests per second.
Q: What will you be covering in your presentation at Hadoop World?
Cope: I will be sharing a case study of how OpenLogic uses Hadoop and HBase to power our OLEX open source scanning solution. OLEX is used by large enterprises to identify the open source they are using and ensure compliance with open source policies and licenses.
Hadoop and HBase are fast, accurate, and highly scalable, but it does take some tuning to get the most from your implementation. I'll share lessons learned from OpenLogic's experiences. I'll also talk about why public clouds aren't a panacea for every problem and how open source technologies are being used to power "the cloud".
Q: What benefits do you see from Hadoop?
Cope: OpenLogic's open source scanning solution relies on "Big Data" in order to accurately identify the open source being used- down to even a small snippet of copied code. This requires a huge repository of open source code that simply won't fit on a DVD, a hard drive, or even a single server.
OLEX also requires both large scale batch processing and nearly instantaneous results when looking up data from hundreds of concurrent clients. The open source community
has given us MySQL, Redis, Memcached, and many more excellent and complementary data stores (several of which we also use in production at OpenLogic), but Hadoop and HBase are the right solution for problems of this scale.
Disclosure: Kim Weins serves as Senior Vice President of Marketing at OpenLogic