Skip Links

A vendor-independent comparison of NoSQL databases: Cassandra, HBase, MongoDB, Riak

By Sergey Bushik, senior R&D engineer at Altoros Systems Inc., special to Network World
October 22, 2012 04:26 PM ET

Page 2 of 6

Tools, libraries and methods

For benchmarking, we used Yahoo Cloud Serving Benchmark, which consists of the following components:

● a framework with a workload generator 
● a set of workload scenarios

We have measured database performance under certain types of workloads. A workload was defined by different distributions assigned to the two main choices: 

• which operation to perform 
• which record to read or write

Operations against a data store were randomly selected and could be of the following types:

Insert: Inserts a new record. 
Update: Updates a record by replacing the value of one field. 
Read: Reads a record, either one randomly selected field, or all fields. 
Scan: Scans records in order, starting at a randomly selected record key. The number of records to scan is also selected randomly from the range between 1 and 100.

Each workload was targeted at a table of 100,000,000 records; each record was 1,000 bytes in size and contained 10 fields. A primary key identified each record, which was a string, such as "user234123." Each field was named field0, field1, and so on. The values in each field were random strings of ASCII characters, 100 bytes each.

Database performance was defined by the speed at which a database computed basic operations. A basic operation is an action performed by the workload executor, which drives multiple client threads. Each thread executes a sequential series of operations by making calls to the database interface layer both to load the database (the load phase) and to execute the workload (the transaction phase). The threads throttle the rate at which they generate requests, so that we may directly control the offered load against the database. In addition, the threads measure the latency and achieved throughput of their operations and report these measurements to the statistics module.

The performance of the system was evaluated under different workloads:

Workload A: Update heavily 
Workload B: Read mostly 
Workload C: Read only 
Workload D: Read latest 
Workload E: Scan short ranges 
Workload F: Read-modify-write 
Workload G: Write heavily

Each workload was defined by:

1) The number of records manipulated (read or written) 
2) The number of columns per each record 
3) The total size of a record or the size of each column 
4) The number of threads used to load the system

This research also specifies configuration settings for each type of the workloads. We used the following default settings:

1) 100,000,000 records manipulated 
2) The total size of a record equal to 1Kb 
3) 10 fields of 100 bytes each per record 
4) Multithreaded communications with the system (100 threads)

Testing environment

To provide verifiable results, benchmarking was performed on Amazon Elastic Compute Cloud instances. Yahoo Cloud Serving Benchmark Client was deployed on one Amazon Large Instance:

• 7.5GB of memory 
• four EC2 Compute Units (two virtual cores with two EC2 Compute Units each) 
• 850GB of instance storage 
• 64-bit platform 
• high I/O performance 
• EBS-Optimized (500Mbps) 
• API name: m1.large

Each of the NoSQL databases was deployed on a four-node cluster in the same geographical region on Amazon Extra Large Instances:

• 15GB of memory 
• eight EC2 Compute Units (four virtual cores with two EC2 Compute Units each) 
• 1690GB of instance storage 
• 64-bit platform 
• high I/O performance 
• EBS-Optimized (1000Mbps) 
• API name: m1.xlarge

Amazon is often blamed for its high I/O wait time and comparatively slow EBS performance. To mitigate these drawbacks, EBS disks had been assembled in a RAID0 array with stripping and after that they were able to provide up to two times higher performance.

The results

When we started our research into NoSQL databases, we wanted to get unbiased results that would show which solution is best suitable for each particular task. That is why we decided to test performance of each database under different types of loads and let the users decide what product better suits their needs.

We started with measuring the load phase, during which 100 million records, each containing 10 fields of 100 randomly generated bytes, were imported to a four-node cluster.

HBase demonstrated by far the best writing speed. With pre-created regions and deferred log flush enabled, it reached 40K ops/sec. Cassandra also showed great performance during the loading phase with around 15K ops/sec. The data is first saved to the commit log, using the append method, which is a fast operation. Then it is written to a per-column family memory store called a Memtable. Once the Memtable becomes full, the data is saved to disk as an SSTable. In the "just in-memory" mode, MySQL Cluster could show much better results, by the way.

* Workload A: Update-heavily mode. Workload A is an update-heavily scenario that simulates the database work, during which typical actions of an e-commerce solution user are recorded. Settings for the workload: 
1) Read/update ratio: 50/50 
2) Zipfian request distribution

Our Commenting Policies
Latest News
rssRss Feed
View more Latest News