Review: Cloud storage

Storage on a budget: GlusterFS shines in open source storage test

But beware of tradeoffs when it comes to documentation, management tools and failover

1 2 Page 2
Page 2 of 2

Monitoring and extending a Ceph cluster is performed at the prompt using a variety of "ceph" commands. For example, to check on the health of a cluster you can use commands like "ceph health" or "ceph status". These commands will write out parameters to the screen. This is helpful on the fly, but for the bigger picture we would have liked to see some sort of Web GUI. Ceph does not currently include any sort of GUI tools, but there is a management API available for C and C++ that provides interaction with Ceph monitors and daemons. Third-party vendor Inktank offers a commercial solution named Ceph Enterprise that provides a management GUI (Calamari) together with 24/7 support options.

For the most part we found the Ceph online instructions to be adequate although some additional information for troubleshooting and setup configurations would have been helpful.

Apache Hadoop

Apache Hadoop is an open-source data management framework that provides distributed storage and processing of large data sets across clusters of computers. It can scale from a single machine to thousands of servers. The basic components of the Hadoop project are the Hadoop Distributed File System (HDFS) and Hadoop MapReduce. MapReduce is a framework that works with HDFS to perform high volume data processing. Since our main focus for this review was software-defined storage, we did not test the MapReduce feature.

We tested Hadoop version 2.2.0, the first 2.x general availability release. As for hardware requirements, we did not locate any specific baseline recommended by Apache, but commercial users recommend quad-core CPUs, 8-16 GB plus RAM and 2 TB plus hard drives as a good starting point. Hadoop is built to scale out well on budget friendly equipment and manage hardware failures, so there is not necessarily a need to spend resources on features like redundant power supplies and RAID hardware.

+ ALSO ON NETWORK WORLD 16 of the weirdest places you'll find Linux +

With the newly released version 2.2.0 in hand we set off to install Apache Hadoop as a simple cluster on a pair of CentOS 6.4 servers. Since Hadoop is written in Java and our servers did not already have Java installed, we went ahead and installed Java 1.7 (OpenJDK). Hadoop communication between nodes requires SSH so we installed openssh and generated a private key. Following the online instructions we made changes to a couple of configuration files prior to starting the actual installation (.bashrc and hadoop_env.sh files to specify the JAVA_HOME directory, tells Hadoop where to look for Java).

After downloading the latest Hadoop files we were ready to do the actual install. The software installation is fairly straightforward and involves unpacking the software on each of the nodes in the cluster or just installing the RPMs. Once the install is completed, there are two types of configuration files (read-only defaults and site-specific configurations) needed in order to get a cluster set up properly. This is especially true when setting up large complex clusters. These configurations are done in a series of XML files and although many of the default values can left as-is, there are installation specific parameters that most likely need to be configured, such as host names and ports.

A Hadoop cluster consists of a namenode which manages the file system metadata and one or multiple datanodes that store the data. The namenode uses a block map to keep track of where each block is stored on each data node. The metadata on the namenodes is organized into directories and files, with files divided into uniform block sizes having a default size of 64MB, but this can be changed in the configuration files to suit specific storage requirements. It should be noted that HDFS is "rack aware", meaning that not only does the namenode know which datanode the data is stored on, it can also know which rack the datanode resides in and its geographical relation to other datanodes. The datanodes communicate with the namenode through pings and if a datanode fails, the namenode will automatically replicate the data to a different datanode.

Client access to data stored in HDFS is done through a JVM (Java Virtual Machine) requesting access to a file or directory from the namenode. If the namenode approves the request, it will provide information about the datanodes and related blocks where the requested data is stored. The application performs data operations (such as a read, write or delete) directly on the datanode without needing to access the namenode, thus reducing the namenode workload. Storage integrity in Hadoop is achieved through replication across multiple hosts, this removes the need for RAID. HDFS in not currently fully POSIX compliant, but by relaxing some of the POSIX requirements to enable streaming access to file system data, Apache claims HDFS offers better performance.

Various fuse-dfs projects allow HDFS can be mounted as a standard filesystem on most Linux/Unix flavors using the mount command. Most traditional operations such as mkdir, rmdir, cp, rm, cat and mv are supported with HDFS mounted with FUSE; however certain permissions-related operations (chown and chmod among them) are not supported with HDFS and FUSE.

For our review we first installed the NFS utilities on one of our test servers. Then we enabled NFS in the HDFS XML configuration files. This can be done using Ambari or from the command prompt. In order for the HDFS enabled NFS to work, we first needed to stop the native Linux NFS services and then launch the HDFS services. Once these services have been started, HDFS can be mounted on different clients (Linux, Windows, Mac) provided a user with HDFS permissions is created on the client machine.

We mounted HDFS as a file system on a separate Linux client machine and were able to perform file tasks such as copying data off of HDFS onto the local file system. Hadoop can be managed from the command line using hadoop commands. For example, hadoop fsck –locations will print out the location of all blocks. There are also various Web interfaces such as an internal Web server included on each data and name node. These display basic information such as statistics about the cluster. These web interfaces can be used to browse the file system and also view the logs. Then there is Apache Ambari, which can be used to provision, manage and monitor Hadoop clusters and is a good alternative for those who prefer a graphical management tool.

We decided to try Ambari for ourselves using instructions found on the Apache website. Although Ambari is still in the incubating stage at Apache (we tested version 0.9), we found the documentation to be good, and after some minor configuration wrinkles we had the Ambari dashboard up and running in our browser. Although we had used Ubuntu as our go-to OS for this review, as it turned out we needed CentOS to run Ambari, as Ubuntu is currently not supported. The Ambari dashboard provides an at-a-glance overview of the Hadoop cluster status and metrics.

Values such as HDFS capacity, memory load, network and cluster load are displayed with the ability to drill down to view additional detail. Ambari has a cool heat map feature that allows you to view usage data such as "disk space used by host' and "HDFS garbage collection time". There is also a tab for services where you can monitor, start and stop specific services, and a host tab from which hosts can be added/removed as well as managed and monitored.

Compared to GlusterFS and Ceph, we found Hadoop a tad more cumbersome to install and configure. There are decent instructions on the Apache website along with several third-party sites that provide bits and pieces that are useful, but we would have liked to see a more comprehensive set of step-by-step instructions from the vendor.

Hadoop is widely used. Yahoo claims to be the largest Hadoop user and currently has more than 40,000 nodes in its cluster handling close to 400 petabytes (PB) of data. Facebook is reportedly storing hundreds of petabytes in its Hadoop cluster. Several big name vendors provide commercial implementations for Hadoop (EMC, Dell, Microsoft and Cloudera to name a few). Other third-party vendors such as Hortonworks, Nagios and Ganglia offer various levels of support and extensions for Hadoop.

Perschke is a web and database developer with 15+ years of industry experience. You can reach her at susan@arcseven.com.

Copyright © 2014 IDG Communications, Inc.

1 2 Page 2
Page 2 of 2
The 10 most powerful companies in enterprise networking 2022