Skip Links

Hadoop on Windows Azure: Hive vs. JavaScript for processing big data

By Sergey Klimov and Andrei Paleyes, senior R&D engineers at Altoros Systems Inc., special to Network World
December 06, 2012 01:48 PM ET

Network World - For some time Microsoft didn't offer a solution for processing big data in cloud environments. SQL Server is good for storage, but its ability to analyze terabytes of data is limited. Hadoop, which was designed for this purpose, is written in Java and was not available to .NET developers. So, Microsoft launched the Hadoop on Windows Azure service to make it possible to distribute the load and speed up big data computations.

But it is hard to find guides explaining how to work with Hadoop on Windows Azure, so here we present an overview of two out-of-the-box ways of processing big data with Hadoop on Windows Azure and compare their performance.

When the R&D department at Altoros Systems Inc. started this research, we only had access to a community technology preview (CTP) release of Apache Hadoop-based Service on Windows Azure. To connect to the service, Microsoft provides a Web panel and Remote Desktop Connection. We analyzed two ways of querying with Hadoop that were available from the Web panel: HiveQL querying and a JavaScript implementation of MapReduce jobs.

HOW-TO: Get Hadoop certified ... fast

IN PICTURES: 'The Human Face of Big Data'

We created eight types of queries in both languages and measured how fast they were processed.

A data set was generated based on US Air Carrier Flight Delays information downloaded from Windows Azure Marketplace. It was used to test how the system would handle big data. Here, we present the results of the following four queries:

  • Count the number of flight delays by year
  • Count the number of flight delays and display information by year, month, and day of month
  • Calculate the average flight delay time by year
  • Calculate the average flight delay time and display information by year, month, and day of month

From this analysis you will see performance results tests and observe how the throughput varies depending on the block size. The research contains a table and three diagrams that demonstrate the findings.

Testing environment

As a testing environment we used a Windows Azure cluster. The capacities of its physical CPU were divided among three virtual machines that served as nodes. Obviously, this could introduce some errors into performance measurements. Therefore we launched each query several times and used the average value for our benchmark. The cluster consisted of three nodes (a small cluster). The data we used for the tests consisted of five CSV files of 1.83GB each. In total, we processed 9.15GB of data. The replication factor was equal to three. This means that each data set had a synchronized replica on each node in the cluster.

The speed of data processing varied depending on the block size -- therefore, we compared results achieved with 8MB, 64MB and 256MB blocks.

The results of the research

The table below contains test results for the four queries. (The information on processing other queries depending on the size of HDFS block is available in the full version of the research.)


Brief summary

As you can see, it took us seven minutes to process the first query created with Hive, while processing the same query based on JavaScript took 50 minutes and 29 seconds. The rest of the Hive queries were also processed several times faster than queries based on JavaScript.

Our Commenting Policies
Latest News
rssRss Feed
View more Latest News