The single biggest news coming from Amazon Web Service's first user conference recently was the launch of the company's newest service, Redshift, a cloud-based data warehouse tool. And it prompts the question: Is the cloud the right place for data warehousing?
AWS officials say for businesses struggling to manage their data, the cloud can provide a low-cost alternative to investing in infrastructure to manage it all on their own sites. Perhaps the biggest issues holding back Redshift are the same concerns that come along with using the public cloud in general, though. Some just don't feel comfortable putting sensitive financial or personally identifiable data in anyone's public cloud. And then there's the issue of how all that data is actually transferred into the cloud.
These issues -- a combination of potential benefits related to cost and manageability, combined with concerns about security and data transfer -- will likely mean that Redshift follows the path of many of AWS's other enterprise-geared services, says Jeff Kelly, a big-data researcher at The Wikibon Project. Forward-looking businesses that have already embraced Amazon's cloud may move more quickly to the cloud for services like data warehousing, whereas larger enterprises that have been slow to jump into the public cloud may test the service on a use-case basis to see if it's the right fit for them.
CLOUD SHOWDOWN: Amazon vs. Rackspace (OpenStack) vs. Microsoft vs. Google
Data warehouses have traditionally been defined as customized data storage services that aggregate data from multiple different sources and collect it in a central location to be able to run reports and queries of it. Many companies use data warehouses to compile regular financial reports or business metric analyses. Redshift is a column/SQL-based tool designed to scale from a terabyte up to multiple petabyte size.
Along with announcing Redshift AWS also released two new virtual machine instances types meant to work with Redshift, including an XL instance that has 2TB of local storage, and an 8XL instance type with 16TB of storage. AWS has partnered with database analysis company ParAccel to architect Redshift after Amazon.com, AWS's parent company, invested in the company last year. Like traditional on-premise data warehouses, Redshift can be architected to, for example, integrate data from Amazon's DynamoDB NoSQL database, Simple Storage Service (S3), or from existing applications on customer's own premises. Redshift is a repository for the data for it to be exposed to business analytics tools that run reports on it.
"I think there will definitely be some interest" for Redshift, says Kelly, the Wikibon researcher. "One issue with data warehousing is many times this is highly critical, proprietary information that some may be reluctant to ship off to a cloud provider." For organizations with data that is siloed, has variable demands, or for companies that don't have the on-premise infrastructure to manage data warehousing, it could be an attractive option, though. "If you're already doing data management in the cloud, and particularly Amazon's cloud, this seems like an opportunity to take advantage of a new service," he says.
One of the biggest challenges with data warehousing in the cloud is how the data is transferred up into AWS's cloud. Pumping terabytes, or even petabytes, of data up into AWS's cloud over the public Internet can not only come with security concerns, but will eat up bandwidth. AWS does have connections with third-party provider sites, like Equinix, for direct connections to its cloud. And AWS officials say sending data on physical disks via a shipping service is a common way customers get data into and out of AWS's cloud.
Of course, data migration to the cloud is not as much of a problem if the data is already in AWS's cloud, which is the case for many startups that have gone all in on AWS's services thus far. AWS released Data Pipeline on the second day of the conference to help manage the transfer of data all around AWS's cloud using 10 gigabit connections. But many businesses with a lot of data already have a data warehouse, so perhaps an enterprise may test out Redshift for new data warehousing, but sensitive information about company such as financial reports or personally identifiable information of customers may not make it up there any time soon, Kelly suggests.
One of the biggest advantages of Redshift, AWS says, is the cost. AWS says, based on Amazon.com's own use of Redshift, that it can manage data at around $1,000 per terabyte of data per year, compared to $19,000 to $25,000 per terabyte of data per year for an on-premise data warehouse.
Matt Wood, AWS chief data scientist
That's a potential cost savings for big companies, and removing a cost barrier that have held data warehousing back from small and mid-sized businesses, says AWS Chief Data Scientist Matt Wood. Equally important, he says, is that Redshift and other AWS services allow companies to focus on their own businesses, instead of managing infrastructure. Redshift is "designed to take away the undifferentiated heavy lifting of running infrastructure at heavy scale," Wood says. "This allows you to focus on your core competencies."
So if AWS believes data warehousing is such a great fit for cloud computing, why haven't any other vendors done it? Kognitio, a European data management and BI platform, has made some rumblings about cloud-based data warehousing and is attempting to push into the U.S. enterprise market, but has not gained a large amount of traction since making the push more than two years ago. The likes of Oracle, Microsoft, IBM and other data warehousing stalwarts can enable cloud-based data warehousing, but have not been overtly advertising the capability.
Then there are the new players in this space. EMC and VMware made somewhat of a splash recently when the companies announced their Pivotal Initiative, a combination of big data and cloud-based technologies from each of the companies. Google, with its BigQuery service, is another player to watch in this space, Kelly says.
Redshift seemed like a natural move for AWS, though. The company has been looking to beef up its products, services and general appeal to the enterprise market recently, which is evident by the announcement of new services like Redshift and Glacier. AWS executives spoke quite a bit about the enterprise market at the user conference as well, clearly making a pitch to big businesses. Redshift is still in the early stages, though; AWS only announced a limited beta of the product and has been mum thus far on when a full-featured Redshift will be available.
Even if most enterprises may not be ready for large use cases of data warehousing in the cloud right now, Philip Russom, research director for data management at The Data Warehousing Institute, says Redshift could be an attempt for AWS to be a first-mover in this market. "If you're a vendor, you want to be out ahead of the demand before it really cranks up," he says. "Amazon has a good track record in the cloud world, so if someone is looking to offload data warehousing to the cloud, they seem like a natural place to look."