- Best iPhone, iPad Business Apps for 2014
- 14 Tech Conventions You Should Attend in 2014
- 10 Desktop Apps to Power Your Windows PC
- How to Add New Job Skills Without Going Back to School
Network World - Hadoop, an open source framework that enables distributed computing, has changed the way we deal with big data. Parallel processing with this set of tools can improve performance several times over. The question is, can we make it work even faster? What about offloading calculations from a CPU to a graphics processing unit (GPU) designed to perform complex 3D and mathematical tasks? In theory, if the process is optimized for parallel computing, a GPU could perform calculations 50-100 times faster than a CPU.
[ 2.0 RELEASE: Get ready for a flood of new Hadoop apps ]
The idea itself isn't new. For years scientific projects have tried to combine the Hadoop or MapReduce approach with the capacities of a GPU. Mars seems to be the first successful MapReduce framework for graphics processors. The project achieved a 1.5x-16x increase in performance when analyzing Web data (search/logs) and processing Web documents (including matrix multiplication).
Following Mars' groundwork, other scientific institutions developed similar tools to accelerate their data-intensive systems. Use cases included molecular dynamics, mathematical modeling (e.g., the Monte Carlo method), block-based matrix multiplication, financial analysis, image processing, etc.
On top of that, there is BOINC, a fast-evolving, volunteer-driven middleware system for grid computing. Although it does not use Hadoop, BOINC has already become a foundation for accelerating many scientific projects. For instance, GPUGRID relies on BOINC's GPU and distributed computing to perform molecular simulations that help "to understand the function of proteins in health and disease." Most of other BOINC projects related to medicine, physics, mathematics, biology, etc. could be implemented using Hadoop+GPU, too.
So, the demand for accelerating parallel computing systems with GPUs does exist. Institutions invest in supercomputers with GPUs or develop their own solutions. Hardware vendors, such as Cray, have released machines equipped with GPUs and pre-configured with Hadoop. Amazon has also launched Elastic MapReduce (Amazon EMR), which enables Hadoop on its cloud servers with GPUs.
But does one size fit all? Supercomputers provide the highest possible performance, yet cost millions of dollars. Using Amazon EMR is feasible only in projects that last for several months. For larger scientific projects (two to three years), investing in your own hardware may be more cost-effective. Even if you increase the speed of calculations using GPU within your Hadoop cluster, what about performance bottlenecks related to data transfer? Let's explore this in detail.
Data processing implies data exchange between HDD, DRAM, CPU and GPU. Figure 1 shows how data is transferred when a commodity machine performs computations with a CPU and a GPU.
As a result, the total amount of time we need to complete any task includes:
According to Tom's Hardware (CPU Charts 2012), performance of an average CPU ranges from 15 to 130 GFLOPS. At the same time, performance of Nvidia GPUs, for instance, varies within a range of 100-3,000+ GFLOPS (2012 comparison). All of these measurements are approximate and largely depend on the type of task and the algorithm. Anyway, in some scenarios, a GPU can speed up computations by nearly five to 25 times per node. Some developers claim that if your cluster consists of several nodes, performance can be accelerated by 50x-200x. For example, the creators of the MITHRA project achieved a 254x increase.