Skip Links

A vendor-independent comparison of NoSQL databases: Cassandra, HBase, MongoDB, Riak

By Sergey Bushik, senior R&D engineer at Altoros Systems Inc., special to Network World
October 22, 2012 04:26 PM ET

Page 6 of 6

Cassandra's SSTable is a sorted strings table that can be described as a file of key-value string pairs sorted by keys and key-value pairs written in a particular order. To achieve maximum performance during range scans, we had to use an order-preserving partitioner. Scanning over ranges of order-preserving rows is super-fast. It is similar to moving a cursor through a continuous index. However, the database cannot distribute individual keys and corresponding rows over the cluster evenly, thus a random partitioner is used to ensure even data distribution. This is the default partitioning strategy in Cassandra. Random partitioning ensures good load balancing and provides some additional speed in range scans with an order preserving partitioner.

In MongoDB 2.5, the table scan triggered by the { "_id":{"$gte": startKey}} query showed a maximum throughput of 20 ops/sec with a latency of ≈ 1 sec.

The performance of MySQL Cluster was under 10 ops/sec with a latency of 400 ms. It is partitioned over the nodes in the cluster, so the system uses an optimizer to translate SQL commands into a query plan. The execution of this plan is divided among multiple nodes. For range scans, a B-tree index is used to make column comparisons in such expressions as >, <, or BETWEEN.

Sharded MySQL is based on key hashing on the connector side and does not support true range scans over a cluster. While a single shard did about 10 ops/sec, the whole sharded setup showed near 40 ops/sec with a latency of up to 400 ms. MyISAM caches index blocks but not data blocks. There can be an overhead due to re-reading data blocks from the OS buffer cache.

The Riak bitcask storage engine does not support range scans. This can be done through secondary indexes with eleveldb and special $key index referring to the primary key. Eleveldb showed insufficient performance that started to degrade after 50,000,000 records had been imported and we fell back to bitcask.

* Workload G: Insert-mostly mode. Settings for the workload: 
1) Insert/Read: 90/10 
2) Latest request distribution

HBase showed the best results under a workload that included large volumes of writes. Cassandra was second. The NDB engine of MySQL Cluster also managed intensive writes perfectly well.

Conclusion

As you can see, there is no perfect NoSQL database. Every database has its advantages and disadvantages that become more or less important depending on your preferences and the type of tasks.

For example, a database can demonstrate excellent performance, but once the amount of records exceeds a certain limit, the speed falls dramatically. It means that this particular solution can be good for moderate data loads and extremely fast computations, but it would not be suitable for jobs that require a lot of reads and writes. In addition, database performance also depends on the capacity of your hardware.

It was hardly possible to include all of the performance diagrams and describe everything in one article. You can download the full version of the research that contains separate chapters dedicated to every database, YCSB and Amazon EC2 configuration details, and appendix with other performance diagrams at http://altoros.com/nosql-research.

We hope this research will be useful to both developers working with NoSQL solutions and customers trying to choose a database. Altoros's R&D team will regularly revise and update information of this research to cover new databases and releases of the most popular products.

About the author: Sergey Bushik is a senior R&D engineer at Altoros. He has more than seven years of experience in implementation of Java-based projects that include big data processing, data mining and Hadoop computations. Sergey has a number of certificates in Java and is a Sun Certified Enterprise Architect for the Java Platform. He is a regular speaker at international conferences -- most recently, he delivered sessions at Big Data Meetup (Sunnyvale, Calif.), GOTO Copenhagen 2012, Hadoop Evening (Eastern Europe), etc.

Read more about software in Network World's Software section.

Our Commenting Policies
Latest News
rssRss Feed
View more Latest News