Skip Links

Amazon outage one year later: Are we safer?

Experts warn that many businesses still do not have measures to insulate themselves from big service provider outages

By , Network World
April 27, 2012 06:03 AM ET

Network World - Amazon Web Services last April suffered what many consider to be the worst cloud service outage to date - an event that knocked big name customers such as Reddit, Foursquare, HootSuite, Quora and others offline, some for as many as four days.

So, a year after AWS's major outage, has the leading Infrastructure-as-a-Service and cloud provider made changes necessary to prevent another meltdown? And if there is a huge repeat, are enterprises prepared to cope? The answers are not cut and dried, experts say.

RELATED: Five tips for surviving a cloud outage

REMEMBER WHEN? Amazon EC2 outage calls 'availability zones' into question 

In part, it's difficult to answer these questions because AWS is notoriously close-lipped about the inner workings of its massive cloud operations, which not only had an outage last April, but suffered a shorter-lived disruption in August. What's more, it's hard to get a read on individual cloud customers' private plans, although industry watchers such as IDC analyst Stephen Hendrick say many enterprises have a long way to go to be fully isolated from provider shortfalls.

"Some folks had their bases covered, for others, it hit them pretty hard," says IDC analyst Stephen Hendrick, recalling last year's AWS outage. "There are certainly lessons to be learned, the question is whether customers want to do what it really takes to protect themselves."

First, a recap of what happened last year: A few weeks after AWS's outage, the company released a post-mortem report detailing what caused the disruption and steps the company took immediately following. Basically, human error started the chain reaction of events. In the wee hours of the morning of April 21, 2011, while attempting to upgrade the East Coast region of the company's Elastic Block Storage (EBS) service - a storage feature that links in with the company's Elastic Cloud Compute (EC2) offering - part of the EBS network was switched to a lower capacity infrastructure that wasn't prepared to handle the traffic of the EBS system. The EBS nodes attempted to rectify the problem themselves, causing a network traffic jam that soon spilled over into another AWS feature, the Relational Database Service (RDS), another log storage offering. In all, about 13% of the EBS nodes in the affected area were impacted by the outage, and after the four-day event, 0.07% of the impacted data was permanently lost.

Experts say AWS has made improvements to its system since then, but it's unclear just how substantial those are. For example, in the post-mortem report the company says it audited its change process and increased the use of automation tools when making updates to avoid human error. Drue Reeves, a Gartner analyst who tracks the cloud industry and AWS, says the company has boosted its primary and secondary EBS networks to handle high network capacities. "It's made EBS more resilient," he says. "They've taken some steps to rectify the situation to make sure this instance doesn't happen again, but that doesn't mean we won't have other outages."

Our Commenting Policies
Cloud computing disrupts the vendor landscape

 

Latest News
rssRss Feed
View more Latest News