If you've got a lot of data, then Hadoop either is, or should be on your radar.
Once reserved for the Internet empires like Google and Yahoo, the most popular and well-known big data management system is now creeping into the enterprise. There are two big reasons for that: 1) Businesses have a lot more data to manage, and Hadoop is a great platform, especially for combining both legacy old data, and new, unstructured data 2) A lot of vendors are jumping into the game of offering support and services around Hadoop, making it more palatable for enterprises.
Most firms estimate that they are only analyzing 12% of the data that they already have, leaving 88% of it on the cutting-room floor.
— According to Forrester's Software Survey Q4, 2013
+MORE FROM NETWORK WORLD: Get started with Hadoop: Free training resources from Cloudera, MapR and more | Sizing up the Hadoop ecosystem, a guide to the projects that make up Apache Hadoop | 18 essential Hadoop tools for crunching big data +
“Hadoop is unstoppable as its open source roots grow wildly and deeply into enterprise data management architectures,” Forrester analysts Mike Gualtieri and Noel Yuhanna wrote recently in the company’s Wave Report on the Hadoop marketplace. “Forrester believes that Hadoop is a must-have data platform for large enterprises, forming the cornerstone of any flexible future data management platform. If you have lots of structured, unstructured, and/or binary data, there is a sweet spot for Hadoop in your organization.”
So where do you start? Forrester says there are a variety of places to go, and it evaluated nine vendors offering Hadoop services to find the pros and cons of each. Forrester concluded that there is no clear market leader at this point, with relatively young companies in this market offering compelling services alongside the tech titans.
First, some background: Hadoop is an open source Apache project that anyone can freely download the core aspects of - these include Hadoop Common, Hadoop Distributed File System (HDFS), Hadoop YARN, and Hadoop MapReduce. Many companies from IBM to Amazon Web Services, Microsoft and Teradata all have packaged Hadoop into more easily-consumable distributions or services. Each company takes a slightly different strategy, but the key differentiator for all of these is that Hadoop has the ability to distribute workloads across potentially thousands of servers, making big data manageable data.
Note: This list is based on vendors listed in Forrester’s Wave report and is not meant to be all encompassing of Hadoop and big data management platforms. It is listed in alphabetical order.
Amazon Web Services
Customers looking for a public cloud hosted Hadoop platform needn’t look much further than the company Forrester calls the “King of the cloud” - Amazon Web Services. The company’s Hadoop product is named Elastic Map Reduce (EMR), which AWS says uses Hadoop to offer big data management services. It is not pure open source Hadoop though, it’s been tinkered to run specifically on AWS’s cloud.
Forrester says that EMR has the largest adoption of the Hadoop platforms in the market. It already has a wide variety of partners that offer services on top of EMR, such as ones that specialize in query, modeling, integration and management. And AWS is innovating; on the roadmap, according to Forrester, is the ability for EMR to automatically scale and resize based on workload needs. The company plans to roll out more robust support for EMR with its other products and services, including its RedShift data warehouse, its newly announced Kenesis real-time processing engine and it has plans to offer support for additional NoSQL databases and business intelligence tools. The one thing AWS does not have is a Hadoop distribution that users can run on their own premises, but the next two companies specializes in that.
Cloudera has a distribution of the open source Hadoop, which uses many aspects of the Apache project, but has a number of advancements on top of that as well. Cloudera has developed a number of features for its product, from a management and monitoring tool named Cloudera Manager, to a SQL engine to run relational data on Hadoop named Impala. Cloudera uses open source Hadoop for the basis of its distribution, but it is not a pure open source product. When Cloudera’s customers need something that open source Hadoop doesn’t have, they build it, or they find a partner who has it. “Cloudera’s approach to innovation is to be loyal to core Hadoop but to innovate quickly and aggressively to meet customer demands and differentiate its solution from those of other vendors,” Forrester says. The result has been steady adoption of Cloudera’s platform, with more than 200 paying customers, Forrester says, some whom have more than 1 petabyte under management across more than 1,000 nodes.
Like Cloudera, Hortonworks is a pure-play Hadoop company. Unlike Cloudera, Hortonworks sticks to the open source Hadoop code stronger than perhaps any other vendor. Hortonworks’ goal is about building up the Hadoop ecosystem and Hadoop users, and advancing the open source code. Its platform sticks closely to the open source code. Company officials say this benefits users because it prevents vendor lock in (if a Hortonworks customer ever did need to leave their platform, then they could easily port applications off of the platform on to the open source code). That’s not to say Hortonworks does not innovate on top of the open source code though. The company gives all of its work developing the platform back to the open source community. An example of this is Ambari, a tool developed by Hortonworks to fill a hole in the project around cluster management. Hortonworks’ approach has garnered strong partnerships for Hortonworks from vendors like Teradata, Microsoft, Red Hat and SAP.
When enterprises think of big IT projects, many think of IBM, and rightly so. Because of that, IBM has become a major player in the world of Hadoop projects. Forrester says IBM already has more than 100 Hadoop deployments, and many customers with petabytes worth of data. The company leverages its vast experience in grid computing, a global data center and enterprise implementation experience to its big data projects. “IBM’s road map includes continuing to integrate the BigInsights Hadoop solution with related IBM assets like SPSS advanced analytics, workload management for high performance computing, BI tools, and data management and modeling tools,” Forrester says.
Like Amazon Web Services, Intel is leveraging and optimizing its version of Hadoop to run on its hardware, specifically its Xeon chips. For customers looking to push the limits of their Hadoop system and looking for the closest affinity between the software and the hardware, then Intel’s distribution of Hadoop could be the one for you. Forrester notes that Intel just recently rolled this product out though, so the company is expected to innovate quite a bit on top of the version it has in the market now. Intel and Microsoft were listed as “strong performers” in the Hadoop marketplace, compared to the other seven previously listed companies who were listed as “leaders.”
MapR Technologies is perhaps the best Hadoop distribution company that many people haven’t heard of. In Forrester’s survey of Hadoop users that is used to compile its Wave report, MapR rated the highest for its current offering, with the highest scores for its distribution’s architecture and data processing capabilities. The company’s secret sauce is a set of unique capabilities MapR has managed to work into its version of Hadoop. For example, MapR’s distribution supports Network File Systems (NFS) and MapR has built up disaster recovery and high availability features into its distribution. Forrester says MapR just doesn’t have the brand name recognition compared to Cloudera and Hortonworks in the Hadoop market. Increased partnerships and marketing could turn MapR into a major Hadoop company, though suggests.
Microsoft isn’t historically known as being a company that embraces open source software, but in this case it is taking strides to not only enable Hadoop to run on Windows, but put forth code toward the open source project to advance the Hadoop ecosystem more broadly. The fruits of that labor are seen in Microsoft’s public cloud Windows Azure’s HDInsight product. It’s a Hadoop as a service offering based on Hortonworks’ distribution of the platform but specifically designed to run on Azure.
Microsoft has some other nifty projects too, including a production-ready feature named Polybase that allows information on SQLServer to also be searched during Hadoop queries. “Microsoft’s significant presence in the database, data warehouse, cloud, OLAP, BI, spreadsheet (PowerPivot), collaboration, and development tools markets offers an advantage when it comes to delivering a growing Hadoop stack to Microsoft customers,” Forrester says. Like Intel, Microsoft was listed as a “strong performer,” but not a leader in this industry yet.
Last year EMC and VMware combined a handful of assets from each company to form Pivotal, which is basically a spin-out from the companies. One of the big aspects Pivotal is working on is a Hadoop distribution, along with the Cloud Foundry PaaS. In doing so, Pivotal has added some tooling on top of the open source code, specifically a SQL engine named HAWQ and a Hadoop appliance made specifically for running the big data platform. Forrester says the leading advantage of Pivotal’s Hadoop platform is the integration between its distro and other Pivotal, EMC and VMware products. Pivotal will benefit from its EMC and VMware backing as well. Thus far, however, the company only has fewer than 100 installations, mostly at small to midsized customers, according to Forrester.
A company like Teradata could see Hadoop as a threat or an opportunity. The company specializes in data management, particularly on the SQL and relational database side. So the rise of a NoSQL platform like Hadoop could threaten the company. Instead, Teradata has embraced Hadoop. By partnering with Hortonworks, Teradata now offers customers the ability to use a Hadoop platform that’s integrated with its SQL offerings, giving existing Teradata customers a plug and play-ready Hadoop platform that will work seamlessly with data already stored in Teradata warehouses.