- Silicon Valley's 19 Coolest Places to Work
- Is Windows 8 Development Worth the Trouble?
- 8 Books Every IT Leader Should Read This Year
- 10 Hot Hadoop Startups to Watch
Network World - This vendor-written tech primer has been edited by Network World to eliminate product promotion, but readers should note it will likely favor the submitter's approach.
"Big data" is the buzzword of the day, and learning to manage it and extract value from it is top of mind for executives across industries. A contributing factor to the surge in big data is a shift in prevailing corporate data management philosophies.
Previously, companies collected only the data that was necessary to ask and answer specific questions. Today, organizations operate under the mindset that all data should be kept, no matter where it comes from, as you never know when you might need it or what questions will arise. This trend in data hording is fueled by the dramatic drop in prices for computer hardware and network equipment, which has allowed companies to hold onto more, at a more affordable price.
IN PICTURES: 'The Human Face of Big Data'
HOW-TO: Get Hadoop certified ... fast
So how does a company begin to cope with such extraordinary amounts of data? Hadoop. Architect Doug Cutting, who named Hadoop after his son's yellow toy elephant, created this software framework under the Apache license. In a nutshell, it is open source software designed to allow organizations to store and process massive amounts of data.
And, in a world where 571 new websites are created, more than 100,000 tweets are generated and more than 2 million Google queries are made every minute of the day, big data management is imperative. Hadoop addresses that need, providing companies with the ability to store and make sense of the massive amounts of data necessary to address business concerns.
Hadoop was designed to run on a cluster of computers, making it possible to use commodity hardware and distribute work across machines to achieve massive scalability. This distributed nature is what makes it easy for Hadoop to process and store such large quantities of data -- and makes it cheap and easy to expand as needs increase.
Most companies are using their Hadoop systems as a data refinery -- taking in massive amounts of data, processing it into manageable and more meaningful chunks, and then asking the data questions to gather useful insights. Once you have a Hadoop cluster, it's time to start the processing through MapReduce, which converts the data into the same format (tuples), and combines all the reformatted data into a smaller set that can be more easily consumed, further processed and analyzed.
Unfortunately, like all things in life, Hadoop isn't perfect. One of the primary problems companies run into is adopting Hadoop within their current infrastructures.
For example, how do enterprises access data from their Hadoop refinery when most of the open source drivers are written without full ODBC spec support? Without full support for the ODBC core functions, companies are having a hard time reconciling their BI suites with Hadoop and are being forced to undertake special projects specifically to analyze Hadoop data. This occurs because Hadoop's biggest limitation at the moment is that it doesn't fit with the existing corporate ecosystem of data analytics and visualization tools.