Skip Links

Amazon: Bad execution during planned upgrade caused outage

Amazon apologizes, explains cloud outage, will expand high availability services

By , Network World
April 29, 2011 12:37 PM ET

Network World - It took about a week, but Amazon has fully recovered from its most serious outage in the five-year-history of the Elastic Compute Cloud, offered an explanation of what went wrong and revealed a new roadmap for preventing future problems.

The 5,700-word explanation starts with a discussion of storage volumes in an East Coast data center "that became unable to service read and write operations."

FAILOVER LIMITS: Amazon EC2 outage calls 'availability zones' into question

This caused virtual machines trying to use storage volumes to go offline. Amazon had to disable various APIs while it got a handle on the problem, and high error rates and latencies resulted. A small percentage of customers also suffered a permanent loss of data.

But it all started just after midnight on April 21 when a planned upgrade went wrong. One of the operations was "executed incorrectly."

The goal "was to upgrade the capacity of the primary network," Amazon says. "During the change one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS [Elastic Block Store] network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network."

Ultimately, this meant a portion of the storage cluster "did not have a functioning primary or secondary network because traffic was purposely shifted away from the primary network and the secondary network couldn't handle the traffic level it was receiving."

While some have wondered why Amazon hadn't apologized for the outage, Amazon has now done so, saying: "Last, but certainly not least, we want to apologize. We know how critical our services are to our customers' businesses and we will do everything we can to learn from this event and use it to drive improvement across our services." Amazon also pledged to improve communication during outages.

Perhaps more important, Amazon says it will automatically provide service credits to customers running either Elastic Block Store or Relational Database Services instance in the affected Availability Zone "whether their resources were impacted or not." It will be a 10-day credit "equal to 100% of their usage of EBS volumes, EC2 instances and RDS database instances," Amazon says.

The outage, which took popular websites such as FourSquare and Reddit offline, showed the limitations of high availability services available to Amazon EC2 customers. Amazon splits its data centers into isolated regions and availability zones. Customers are able to spread applications and data across multiple availability zones to prevent downtime, but the zones are not far apart geographically and multiple zones went down last week.

The regions -- on the East Coast and West Coast -- provide much more isolation but it is difficult at best to use them simultaneously in a way that would keep applications running without downtime.

"If you want to move data between Regions, you need to do it via your applications as we don't replicate any data between Regions on our users' behalf," Amazon says. "You also need to use a separate set of APIs to manage each Region."

Our Commenting Policies
Cloud computing disrupts the vendor landscape

 

Latest News
rssRss Feed
View more Latest News