Incorrect execution during an upgrade led to Amazon's big cloud outage.
It took about a week, but Amazon has fully recovered from its most serious outage in the five-year-history of the Elastic Compute Cloud, offered an explanation of what went wrong and revealed a new roadmap for preventing future problems.
FAILOVER LIMITS: Amazon EC2 outage calls 'availability zones' into question
This caused virtual machines trying to use storage volumes to go offline. Amazon had to disable various APIs while it got a handle on the problem, and high error rates and latencies resulted. A small percentage of customers also suffered a permanent loss of data.
But it all started just after midnight on April 21 when a planned upgrade went wrong. One of the operations was "executed incorrectly."
The goal "was to upgrade the capacity of the primary network," Amazon says. "During the change one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS [Elastic Block Store] network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network."
Ultimately, this meant a portion of the storage cluster "did not have a functioning primary or secondary network because traffic was purposely shifted away from the primary network and the secondary network couldn't handle the traffic level it was receiving."
While some have wondered why Amazon hadn't apologized for the outage, Amazon has now done so, saying: "Last, but certainly not least, we want to apologize. We know how critical our services are to our customers' businesses and we will do everything we can to learn from this event and use it to drive improvement across our services." Amazon also pledged to improve communication during outages.
Perhaps more important, Amazon says it will automatically provide service credits to customers running either Elastic Block Store or Relational Database Services instance in the affected Availability Zone "whether their resources were impacted or not." It will be a 10-day credit "equal to 100% of their usage of EBS volumes, EC2 instances and RDS database instances," Amazon says.
The outage, which took popular websites such as FourSquare and Reddit offline, showed the limitations of high availability services available to Amazon EC2 customers. Amazon splits its data centers into isolated regions and availability zones. Customers are able to spread applications and data across multiple availability zones to prevent downtime, but the zones are not far apart geographically and multiple zones went down last week.
The regions -- on the East Coast and West Coast -- provide much more isolation but it is difficult at best to use them simultaneously in a way that would keep applications running without downtime.
"If you want to move data between Regions, you need to do it via your applications as we don't replicate any data between Regions on our users' behalf," Amazon says. "You also need to use a separate set of APIs to manage each Region."
Despite the benefits spreading applications across regions might provide, Amazon's proposed fixes focus on the availability zones.
Many users who took advantage of multiple availability zones survived the outage without "significant availability impact," but that wasn't the case for all. Amazon says the outage has "taught us that we must make further investments" to ensure that failures in single availability zones won't impact storage access across multiple zones.
In addition to making several back-end technical improvements, Amazon says it intends to make it easier to take advantage of availability zone redundancy. For example, the Virtual Private Cloud service will be upgraded to allow customers access to multiple zones "as soon as possible." Applications using VPC are more secure than those that do not, yet currently cannot be built across multiple availability zones.
Amazon will also host a series of free webinars on designing fault-tolerant applications in the cloud, and says it will "look to provide customers with better tools" for building applications that span multiple zones.
Amazon's post-mortem indicated that the outage could have been prevented, saying "the trigger for this event was a network configuration change. We will audit our change process and increase the automation to prevent this mistake from happening in the future." Amazon will also make further changes to prevent storage cluster problems. In last week's outage, simply adding additional capacity to clusters in advance could have allowed the systems to recover from the major problems more quickly.
But there are many factors at plan that Amazon seems to still be looking at.
"As with any complicated operational issue, this one was caused by several root causes interacting with one another and therefore gives us many opportunities to protect the service against any similar event reoccurring," Amazon says.