- 15 Non-Certified IT Skills Growing in Demand
- How 19 Tech Titans Target Healthcare
- Twitter Suffering From Growing Pains (and Facebook Comparisons)
- Agile Comes to Data Integration
Network World - Amazon's infamous cloud outage in April brought down a number of popular Web sites, including foursquare and Reddit - but many of Amazon's enterprise cloud customers were able to weather the storm without experiencing downtime.
They architected their systems for resiliency by using multiple availability zones, having hot backups in traditional data centers, or having a backup cloud provider set up and ready to go in case of a problem.
Silicon Valley-based photosharing company SmugMug stayed up through the outage even as its peers failed. That was partly because it avoided the use of Amazon's Elastic Block Storage - the particular service component that went down.
But the company also spread its systems across several Amazon data centers - what Amazon calls "availability zones."
Other companies would have stayed up as well if they had also distributed their applications, says SmugMug CEO Chris MacAskill. He also recommends that companies also use multiple Amazon regions, which are more isolated from one another than availability zones. Of course, Amazon does charge extra for using multiple zones, so that needs to be taken into account.
SmugMug relies heavily on Amazon, using its cloud-based Simple Storage Service (S3) to store customer photos and videos. SmugMug also uses many instances of the Elastic Compute Cloud (EC2). But instead of using Elastic Block Storage - which is attached to individual EC2 instances, and often used to store operational data - the company still uses traditional data centers.
That has its own downsides - the week of Amazon's outage, for example, the company lost a core router, its backup, and a core master database server. "I wish I didn't have to deal with routers or database hardware failures anymore, which is why we're still marching towards the cloud," MacAskill says.
And, despite the outage, the cloud-based services that he gets from Amazon are still better than what SmugMug could have on its own, he adds, and better than other cloud service providers. "We're very committed to them," he says.
Israel-based startup Kitely Ltd. only used one of Amazon's availability zones - but, fortunately, not the one that went down.
However, the company plans to learn from the experience. "We intend to split all of our services across multiple availability zones," Kitely Vice President of R&D Oren Hurvitz says.
Kitely, which runs cloud-based virtual meeting and collaboration environments based on the OpenSim platform, also performs continuous checks to ensure that all of its services are up and running.
"Our system is designed with the assumption that any service might stop working at any time," he says. "If we discover that a server is not responding then we terminate it and start a new server instead."
Another company unaffected by the outage because it used multiple availability zones was Mashery, which provides APIs to more than 100 companies such as BestBuy, Hoovers and The New York Times. But Mashery also has another backup plan - a traditional data center.
"We very early on realized that there could be a service problem where Amazon would be entirely unavailable, and we decided that we needed fail-over infrastructure," Mashery CEO Oren Michels says. "We have dedicated hardware with Internap."
Atlanta-based Internap Network Services Corp. provides not only a hot backup site for Mashery but also a production environment for customers that need lower latency than possible with a cloud, or services delivered in geographic areas where Amazon is not available.
"We maintain plenty of infrastructure on both sides to handle peak load," he says.
When Mashery was first building its cloud infrastructure two years ago, Amazon was the only real player in town. Backing up to another cloud was not an option back then - but it might be possible now.
"We're definitely keeping our eye on it," he says. "But if it ain't broke, don't fix it. Amazon has worked amazingly well for us. Likewise, Internap has been a great partner and continues to provide us the services we need."
Internap has even lowered its prices to stay competitive, he adds, though price isn't the major factor in his decision-making.
"We have a hundred huge brands as customers," he says. "It's more expensive to lose customers in case their stuff goes down. Our customers pay us to solve their API problems, and that includes that we stay up if there's an outage."
Companies that are just making the transition to the cloud often use traditional data centers as backups at the start of the process, says Rob Enderle, an analyst at research firm Enderle Group.
"You can have a set of lesser resources that are on stand-by that you can failover to," he says. "Often, that's whatever you had before you moved to the cloud. You can fail-over to a lower-performing technology and still hold your customers."
Companies that have some applications running in a traditional data center and some running in the cloud may be able to double up, he says, and use the same disaster recovery site for both, since the odds are low that Amazon would go down at the same exact time as the traditional data center.
But he warned against trusting too much in using one set of cloud services as a backup for another set of cloud services on the same cloud.
"A redundant service might use some of the same resources as the primary service," he says. "Care should be taken to ensure that redundancies are, in fact, redundant and not simply a different name for overlapping hardware and software."
Using a cloud service provider as a backup for a traditional data center is typically more cost-effective than the other way around.
That's because with a cloud service provider, you pay for computing cycles. When it's not being used, customers need only have the minimum computing power running to enable a quick switch-over, and then add more server capacity as needed.
With a traditional data center, enough servers have to be available to handle peak workload, even if they are rarely used. That translates to hardware costs as well as power and staffing requirements - typically a traditional backup center would double total computing costs, while a cloud backup would only add a fraction.