This vendor-written tech primer has been edited by Network World to eliminate product promotion, but readers should note it will likely favor the submitter's approach.
"Big data" is the buzzword of the day, and learning to manage it and extract value from it is top of mind for executives across industries. A contributing factor to the surge in big data is a shift in prevailing corporate data management philosophies.
Previously, companies collected only the data that was necessary to ask and answer specific questions. Today, organizations operate under the mindset that all data should be kept, no matter where it comes from, as you never know when you might need it or what questions will arise. This trend in data hording is fueled by the dramatic drop in prices for computer hardware and network equipment, which has allowed companies to hold onto more, at a more affordable price.
IN PICTURES: 'The Human Face of Big Data'
HOW-TO: Get Hadoop certified ... fast
So how does a company begin to cope with such extraordinary amounts of data? Hadoop. Architect Doug Cutting, who named Hadoop after his son's yellow toy elephant, created this software framework under the Apache license. In a nutshell, it is open source software designed to allow organizations to store and process massive amounts of data.
And, in a world where 571 new websites are created, more than 100,000 tweets are generated and more than 2 million Google queries are made every minute of the day, big data management is imperative. Hadoop addresses that need, providing companies with the ability to store and make sense of the massive amounts of data necessary to address business concerns.
Hadoop was designed to run on a cluster of computers, making it possible to use commodity hardware and distribute work across machines to achieve massive scalability. This distributed nature is what makes it easy for Hadoop to process and store such large quantities of data -- and makes it cheap and easy to expand as needs increase.
Most companies are using their Hadoop systems as a data refinery -- taking in massive amounts of data, processing it into manageable and more meaningful chunks, and then asking the data questions to gather useful insights. Once you have a Hadoop cluster, it's time to start the processing through MapReduce, which converts the data into the same format (tuples), and combines all the reformatted data into a smaller set that can be more easily consumed, further processed and analyzed.
Unfortunately, like all things in life, Hadoop isn't perfect. One of the primary problems companies run into is adopting Hadoop within their current infrastructures.
For example, how do enterprises access data from their Hadoop refinery when most of the open source drivers are written without full ODBC spec support? Without full support for the ODBC core functions, companies are having a hard time reconciling their BI suites with Hadoop and are being forced to undertake special projects specifically to analyze Hadoop data. This occurs because Hadoop's biggest limitation at the moment is that it doesn't fit with the existing corporate ecosystem of data analytics and visualization tools.
In short, companies don't have the technology they need to bridge that gap to connect to and analyze data on the Hadoop platform, which presents a giant hurdle that must be overcome before organizations can truly reap the full benefits Hadoop offers.
To achieve this, companies need fast, ODBC-compliant connectivity they can utilize with their BI suites. As ODBC is the standard of choice for nearly all of the major BI suites, it is the key to unlocking Hadoop data visualization and analysis for these applications. The development of Hadoop drivers is already in the works, with several big-name vendors offering tools and approaches to take the reduced data and move it into traditional warehouses to connect to the ecosystem of analytics tools that currently reside there.
The advent of Hadoop, which was born out of the need to house more and more data, created a whole new market for connecting and analyzing data. It's an area that has yet to be fully claimed in relation to Hadoop but holds huge potential. How do you access data and get it into a form that's easily read by analysis tools? That's the question that companies in the space are confronted with and working to answer.
But there is another "elephant in the room" that you need to consider when looking to move to Hadoop-style data refineries: talent. Once companies have connectivity and their Hadoop clusters are integrated into existing business applications, you need the talent to make sense of it all. Hadoop cluster setup 101 and MapReduce jobs 101 aren't taught as part of the computer science curriculum at most major universities.
As the technology sector has begun to realize this lack of talent, they are responding accordingly: by breeding the data scientist. These scientists are typically some combination of computer scientist and mathematician -- and are seen as "data whisperers," always knowing what to ask the data in order to gain insights into the decisions that affects their companies.
As demand for these jobs grows, more focus at the university level will increase proficiency in this area and we will see a surge in data insights and innovation across all industries. Until then, the complete benefits of Hadoop simply cannot be realized.
The business value of Hadoop
Despite its shortcomings, Hadoop does offer huge potential for business value. As insights are gleaned from mining big data, more companies will seek to integrate Hadoop with existing applications. In fact, whether you know it or not, Hadoop has likely touched your daily life already.
In June this year, an article in The Wall Street Journal caused a huge stir by reporting that Mac users are more likely to spend more money on a hotel room than PC users. Given that, Orbitz searches pushed Mac users toward higher priced rooms.
This stat is based on giant data sets of information collected about the behavior of Mac users (750 terabytes of unstructured data, according to Tnooz) and is a great example of how companies are using Hadoop to create and store large databases of unstructured data, and analyze that data through a data refinery to make good business decisions through those analytics.
In this case, Orbitz stands to win if it can sell the right hotel to the right customer for a positive travel experience. By presenting pricier hotels to Mac users, who are stereotypically thought to have higher incomes, it hopes to boost its business based on huge amounts of data it has been collecting and analyzing for some time.
In another example, Chevron is utilizing big data analytics to get more barrels of oil out of its drilling productions. Sensor technology has advanced a great deal enabling exploration ships are capable of getting much higher resolution scans of the ocean floor.
The newest opportunity for big data in big oil is compiling and reducing the scan data from all the different ships to get a more complete view of the best places to start drilling. Data scientists are able to use systems like Hadoop to store and reduce this data quickly, which means that Chevron and other oil companies using this technique can get more barrels of oil into the market sooner than was previously possible. In addition to finding new oil sites, Chevron is also analyzing data taken from existing oil platforms to get the most effective and efficient production, saving millions of dollars in operating costs.
As we forge deeper into the world of Big Data, the trends that develop from big data use also impact Hadoop. These include the ability to handle a huge amount of information from a mixed bag of data sources, juggle multiple features in one space and consolidate them into something manageable.
As previously mentioned, Hadoop allows companies to capture all available data and save it to answer questions that may surface in the future. When these questions are identified, companies can then reduce existing data to both ask these questions and answer them effectively. The big players in the space are pushing an approach in which you can take your reduced data, move it to the data warehouse and ask questions there. This will allow you to utilize your existing data warehouse and store all of your refined data from your Hadoop systems in one central place, easily accessible to members of your organization. Given that, it's likely that we'll see trends around pairing Hadoop with existing data warehousing and analytics infrastructure in this pipeline format.
But the future of Hadoop extends beyond the data warehouse. The three V's of big data -- volume, velocity and variety -- are last year's problems. the three V's are all issues that we have been working to solve, but they are only the starting place for big data, not the end.
As technology evolves, we will continue to see rapid adoption, and pairing of disparate data will give us conclusions we never thought possible. This will pave the way for the three I's of big data: intelligence, insight and innovation.
As we build better systems, these systems themselves will posses intelligence, as machine learning will help us correlate seemingly unrelated data to achieve new insights and conclusions that we can use to make better decisions about our businesses, our lives and our planet. These insights will then lead to additional innovation in technology that will start the process over again. We are in the infancy of the big data era, and the future is indeed bright.
So where does this leave us? Hadoop, in its present form, offers huge promise but still lacks a few components that hinder it from fully delivering on its potential. The promise is a framework that allows companies to not only store massive amounts of data, but to process it, access it and analyze it, all at affordable prices. Once more effective connectivity is available and measures have been taken to mitigate the talent shortage, there's no telling how companies will use this yellow elephant to make smarter business decisions.
Jesse Davis is the director of research and development for Progress DataDirect.