The latest release of Apache Hadoop code includes a new workload management tool that backers of the project say will make it easier for developers to build applications for the big data platform.
Hadoop has proven itself as a powerful way for some of the leading technology companies in the world like Yahoo and Google to manage large amounts of data. Hadoop systems have thus far relied on MapReduce to process data, but included in the latest iteration of the open source code is Yarn, which is a platform to run other applications within Hadoop alongside MapReduce. Yarn monitors the resources applications need and then provisions the capacity within the distributed computing system.
Hadoop enthusiasts say this is an important feature to let more applications run within the big data open system and could lead to a wave of new analytics apps for Hadoop. "Yarn is on the critical path to Hadoop having better resource management and supporting mixed workloads and usages," says Gartner information management analyst Merv Adrian, who tracks Hadoop. "It fixes some major gaps and will enable some exciting developments in the years ahead."
[ MORE OPEN SOURCE: Stack Wars: OpenStack vs. CloudStack vs. Eucalyptus ]
The 2.0 version adds a number of components, including architecting for high availability, and adding scale to individual clusters, allowing them to grow to 4,000 machines (a Hadoop deployment can consist of multiple clusters). The biggest change though is the addition of Yarn, which has been in planning for four years and under development for two and been described by some as a next-generation MapReduce architecture.
Yarn splits up two major functions currently combined into one by MapReduce; it separates job scheduling/monitoring and resource management. It works by monitoring what resources applications need, then creates containers of CPU and RAM nodes to serve to those apps. "Yarn is fundamentally simple, but extremely scalable," says Arun Murthy, co-founder of Hadoop distribution company Hortonworks, who has been in charge of developing Yarn within the Apache open source community. Blogger Brian Proffitt at ReadWrite notes that Yarn removes "one-at-a-time" limitations of apps running on Hadoop, and allows the Hadoop systems to now run multiple applications at once.
The advantages are multifold. For one, Hadoop is adding functionally to run multiple applications at once. Second, developers can now write apps to Yarn specifications and be assured that they'll work in a Hadoop system. MapReduce can also now focus on its core functionality instead of managing resources for bolt-on apps.
Hadoop backers expect that the advent of Yarn could open the floodgates for new applications being built to run on Hadoop. Already some projects, like Apache Tez, have been created to do more advanced data processing compared to what MapReduce specializes in. Tez uses real-time analytics and in-memory processing for higher-speed queries, for example. There are many more applications expected for streaming analytics. Twitter Storm is one, while other ETL (extract, transform and load) apps could be integrated as well.
Technically engineers could architect the system to allow for additional functionality for analysis on top of MapReduce, but now Yarn acts as a platform for hosting apps for that specific purpose. Some believe Yarn could be the base-level framework for a platform as a service (PaaS) running on Hadoop that could compete with the likes of VMware's open source Cloud Foundry PaaS.
Apache Hadoop 2.0 is expected to be declared stable enough for a beta release at some point this week, with a general availability release expected in the coming weeks after that, Murthy says. Some of Hadoop's earliest adopters, like Yahoo, have already tested Yarn and companies that create commercial distributions of the code are expected to integrate Yarn into their offerings as well. Hortonworks, for example, hopes to have Yarn functionality in its Hadoop distribution by mid to late summer.
So does 2.0, and specifically Yarn, represent Hadoop growing up? "Absolutely," says Adrian, the Gartner analyst. "But mainstream organizations need to rely on the commercial distributor for anything they expect to put into serious production use." Companies like Hortonworks, Cloudera, MapR and even IBM all have commercial distributions of the code. While the project may be growing up, Adrian notes it's still in its "early adolescence," he notes. The addition of Yarn could go a long way to supporting a budding industry of creating applications that run on Hadoop, though.