Many companies are anxious to take advantage of big data cloud services to crunch vast amounts of data for analysis. However, the lack of inherent data security can be a deal-breaker. Now there's a new service that provides integrated data encryption throughout the processes and infrastructure of Amazon EMR. Service subscribers maintain complete control of their encryption keys, thus bolstering the security of their data.
More and more companies are beginning to use big data cloud services to process vast amounts of data. Such services make it possible to parse and analyze unstructured data in order to identify patterns and ultimately improve business strategies.
For example, a security analytics application can take huge amounts of data from log files or a security information and event manager (SIEM) to crunch the data to find indications of interest, attack or compromise. Armed with the resulting information, an organization can fine-tune its IT security defenses to stop future attacks on its IT resources.
Security analytics is just one up-and-coming use for big data cloud services. Other common uses might be for financial analysis, scientific simulation, seismic data analysis, and more. Once companies realize they can reduce long computation processes from days or weeks down to mere hours, they will come up with many more use cases.
It's easy to see the business benefits of accelerating these data-intensive processes, but there's a gray lining to this silver cloud service: poor controls for data security.
It's the same old story about securing data that goes into the cloud. The best way to protect data is to encrypt it. However, the applications that need to process the data must be able to read it in clear text. That means that the cloud service needs access to the keys to decrypt the data, and companies don't want to share their encryption keys with a third party like a cloud service provider.
It's a damned-if-we-do, damned-if-we-don't situation that creates a host of business problems. For one thing, the data owner is liable for any data breaches, even if they result from actions taken by a third party service provider. In addition, the company that owns the data may be prohibited by government regulations or corporate policy from sharing access to data in the clear with any third party.
One of the more popular big data cloud services is Amazon Elastic MapReduce (EMR). It utilizes a hosted Apache Hadoop framework running on the Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3) components. The underlying Hadoop framework uses a computational paradigm called MapReduce that divides work into small fragments and sends them out to numerous nodes for simultaneous processing. This allows Amazon EMR to work with thousands of computation-independent computers and terabytes of data.
Most subscribers of this service upload their data into Amazon S3 with the data in clear text. Amazon EMR requires the data in clear text format, and unfortunately there is no native approach to allow the subscriber to manage the encryption of their data while it is in EMR. While the S3 storage component of EMR does offer encryption, the keys are managed by Amazon, not by the data owner. As a result, data ends up sitting in clear text in various components of the Amazon EMR solution. Needless to say, this is a deal-breaker for many organizations that just can't take the risk of a data breach.
Amazon recognized this and approached Gazzang to provide a solution. The result is Gazzang CloudEncrypt, an integrated data encryption solution for Amazon EMR. CloudEncrypt is a transparent data encryption solution that is purpose-built to protect data as it moves from a company's own data center and into the Amazon EMR infrastructure. The solution is built on Gazzang's data encryption and key management technologies.
Here's a high level view of how the CloudEncrypt solution works:
A subscriber to Amazon EMR is given a client-side uploader that allows them to encrypt their data going into Amazon EMR and to maintain the encryption keys in a Gazzang zTrustee Key Server that can be either on-premise or in the cloud, depending on subscriber requirements. Thus, the service subscriber and not Amazon maintains possession of the keys.
When the EMR nodes ingest the data, the nodes get the encryption key from the zTrustee server to decrypt the data into memory. This allows the data to be processed as dictated by the application, but no human - only a process - ever accesses the encryption keys. Any datasets that result from all the number crunching are automatically encrypted. What's more, any time that data is written to disk, for example in the Hadoop distributed file system (HDFS) stores on the Amazon EMR nodes, it is encrypted once again via an automated process and the keys are stored in the key server.
As a result, data is encrypted at each step of the process and throughout the Amazon EMR infrastructure, and all keys are centrally managed by the key server that is totally controlled by the service subscriber. The keys are inaccessible to the cloud provider as well as any unauthorized process or person. This allows a company to take complete advantage of Amazon EMR for big data analytics without compromising data security. The solution is simple yet elegant as it is fully integrated with Amazon EMR.
Linda Musthaler is a principal analyst with Essential Solutions Corporation. You can write to her at LMusthaler@essential-iws.com.
About Essential Solutions Corp:
Essential Solutions researches the practical value of information technology, and how it can make individual workers and entire organizations more productive. Essential Solutions offers consulting services to computer industry and corporate clients to help define and fulfill the potential of IT.