Amazon protects cloud applications by running them across multiple availability zones, but what happens when more than one zone fails?
For cloud customers willing to pony up a little extra cash, Amazon has an enticing proposition: Spread your application across multiple availability zones for a near-guarantee that it won't suffer from downtime.
Customers who build applications in just one availability zone are more likely to suffer outages. But what happens when multiple availability zones go dark at the same time? We found out today when an outage forced websites such as Foursquare, Reddit, Quora and Hootsuite offline.
"We can confirm connectivity errors impacting EC2 instances and increased latencies impacting EBS (Elastic Block Storage) volumes in multiple availability zones in the US-EAST-1 region," Amazon said Thursday on its service health dashboard.
The US-EAST-1 region, based in northern Virginia, is one of several Amazon regions around the world. There's another one in northern California. Amazon started reporting troubles at 4:41 a.m. Eastern time. By 1:26 p.m., Amazon said it is "now seeing significantly reduced failures and latencies," but that problems were still ongoing. Amazon blamed a "networking event" that "triggered a large amount of re-mirroring" of storage volume, creating a capacity shortage.
Each region contains multiple availability zones -- but little information about each one is known, according to Gartner analyst Drue Reeves. There are four availability zones within the Virginia region, Reeves says. But are they in different data centers? How far apart are they? How is data replicated across zones? Reeves says Amazon hasn't been transparent about these questions. Not knowing the answers makes it difficult for customers to know which methods of building high availability into applications will be most effective.
"Amazon has said for years that they run multiple availability zones within a region to prevent the outage of an entire region," Reeves said. "But yet here we are, and we have an outage inside EC2 for an entire region."
An Amazon spokesperson hasn't yet responded to a request for comment.
Perhaps tellingly, Amazon's service-level commitment provides 99.95% availability for each region -- but not for each availability zone. This is good enough for many customers but well below the "five nines" standard of high availability.
In describing the availability zones on the EC2 website, Amazon says they are "distinct locations that are engineered to be insulated from failures in other Availability Zones and provide inexpensive, low latency network connectivity to other Availability Zones in the same Region."
CLOUD CONFUSION: Six misconceptions about cloud apps
This all begs the question: Can you build applications that span multiple regions, failing over from Virginia to California if necessary?
Reuven Cohen, founder and CTO of Enomaly, a cloud software provider, goes even further. Customers should build applications to run simultaneously across multiple cloud platforms from different vendors, he said.
The fact that major websites "known to be running across multiple availability zones are down" is a sign that the zones aren't foolproof.
"Things go down. It's the nature of the Internet itself," Cohen said. "There's this idea that because you're Amazon you can achieve 100% uptime, and that's the wrong way to look at it."
If Amazon can go down, anyone can. Even Google has had problems with Gmail.
"Vendors may provide redundancy ... but it doesn't address the problem that what if the overall access to that vendor goes down," Cohen said.
Customers should contract with "multiple providers with multiple locations" to survive problems caused by a single vendor, he said.
But is that realistic? Reeves says no, at least for most customers. Cloud computing is supposed to simplify deployment and management of applications. Building an application to work across multiple vendors requires a lot of extra work.
"The reason we can't architect applications across multiple cloud providers is the lack of standards and interoperability," Reeves said. "If you're an application builder and you want to increase your capacity for storage or compute, how you allocate, charge and use that capacity is different for every provider. It's not that it can't be done, it's just very, very difficult."
The simpler idea of sticking just with Amazon and balancing applications across multiple regions isn't so simple either. Amazon doesn't provide the necessary tools to load-balance between regions, so customers have to use additional software on top of their Amazon instances, Reeves says. Amazon's load-balancing service works across availability zones -- the same ones that failed Thursday -- but not across regions.
Anytime there is a cloud outage, some will call into question all cloud computing. That shouldn't be the case, Reeves said, noting "everybody has downtime." The difference with cloud computing is that we're aggregating risk -- many companies run their sites on one platform and when that platform goes down it's a lot more noticeable than when a single business' internal data center fails.
While a single cloud failure shouldn't be seen as an indictment of all cloud computing, Reeves says it does add a new wrinkle to the economic analysis that must be done before enterprises move services to the cloud. If companies run major businesses on top of Amazon, and suffer millions of dollars in lost revenue when there are outages, was the money saved by not building IT services internally worth the risk? Can customers buy insurance to recover lost dollars?
Service-level agreements may provide payments or credits, but if an outage "costs people tens of millions of dollars [in lost revenue] Amazon's not going to pay that back," Reeves said.