Last August, Greenplum and Aster Data made a very appealing case for enterprise use of DBMS-integrated MapReduce. Despite slow adoption, I still think the case has merit. Monday, however, was a bad night for the MapReduce advocates. First, famed MapReduce skeptics Michael Stonebraker and David DeWitt released a series of benchmarks that suggest MPP database management systems far outperform MapReduce. Computerworld should be posting a related story soon. I piled on by posting some thoughts from even-more-skeptical eBay, which thinks MapReduce is 6-8X slower than MPP database managers for comparable tasks.
That doesn't mean MapReduce advocates need to jump off of a ledge. Much of what these benchmarks show is the should-have-been obvious point that MapReduce shouldn't be used to replace DBMS for tasks DBMS are good at. MapReduce applications tend to be concentrated in four areas:
- Text tokenization, indexing, and search
- Creation of other kinds of data structures (e.g., graphs)
- Data mining and machine learning
- Data transformation
and the benchmarks didn't really speak to any of those. But some of those areas may equally fall victim to the "Don't reinvent the wheel argument." For example, LinkedIn is one of the more famous users of MapReduce for text processing, but LinkedIn's text processing is ghastly.
MapReduce is surely an appealing paradigm for lightweight, reliably-parallel programming. At least for research into parallel algorithms, it has much to recommend it. But whether MapReduce will play a major role going forward in production use seems at this point to still be an open question. (Facebook and Cloudera certainly think it will.) Stay tuned for further research.