Amazon Web Services has almost fully recovered from a more than 12-hour event that appears to have started by only impacting a small number of customers but quickly snowballed into a larger issue that took down major sites including Reddit, Imgur and others yesterday.
AWS has not yet said what caused the failure, but the company posted frequent updates throughout the day. It noted a number of times that customers who have architected their systems according to AWS's best practices of spreading workloads across multiple availability zones were less likely to have experienced issues.
AWS GOES DOWN: Amazon EBS failure brings down Reddit, Imgur, others
WE'VE BEEN HERE BEFORE: Amazon outage one year later: Are we safer?
AWS first reported an issue shortly before 11 a.m. PT on Monday when it said a "small number" of Elastic Block Storage (EBS) volumes in a single availability zone in the US-East-1 region were experiencing degraded performance. EBS is a block storage service used in conjunction with Elastic Compute Cloud (EC2).
About an hour later, AWS took away the language noting that only a "small" number of customers were being impacted. By 2:20 p.m. PT, AWS said it restored about half of the impacted volumes, and noted that customers who used multiple availability zones should not have been affected, which AWS has preached in the past.
While AWS continued to restore impacted EBS volumes throughout the afternoon, around 6:30 p.m. a subsequent issue seems to have arisen when AWS reported elevated error rates for associating IP addresses from Elastic Load Balancers (ELBs), which was resolved about an hour later. ELBs transfer workloads within a system or across multiple AZs.
By early today, the latest status updates report that AWS has reached out via email to certain customers who are still being impacted by the event and may have to take action. Other customers may experience increased volume input/output (I/O) latency as the EBS volumes continue a re-mirroring process throughout the day.
EBS volumes weren't the only service impacted during yesterday's outage, though. The Relational Database Service (Amazon RDS) also went down for a "small number" of customers shortly after 11 a.m. PT on Monday, which was mostly recovered about two hours later. As of 4 a.m. PT on Tuesday, AWS reported that it was still progressing to bring back full functionality to RDS.
Like with the EBS issue, AWS reminded customers that if they enabled Point-in-Time Restore option, then they could launch a new database instances using a backup of the impacted database in another availability zone.
AWS Elastic Beanstalk services, which is an application development and deployment platform, also experienced delays in launching, updating and deleting environments, which was resolved around the same time as the EBS issue.
Since yesterday's outage there has been talk in some circles about spreading AWS workloads across multiple availability zones as a way to increase the fault tolerance of your cloud deployment. AWS offers multiple availability zones within the various regions the company operates data centers in, such as the US-East region in Northern Virginia. It says the availability zones are isolated from one another to improve the tolerance to such issues.
But Network World reader Biju Chacko commented that he experienced a multiple-AZ failure. "This is clearly an AWS screwup - their recommended redundancy strategies are not working," he wrote in a comment.
This is the third significant outage AWS has experienced in the past two years. In late June, powerful storms that led to power outages in the mid-Atlantic region were partially the cause of an outage that then was worsened by bugs and bottlenecks within AWS's system. The company issued a detailed postmortem report after that event.
In April 2011, AWS experienced another major outage that took down Reddit, Foursquare, HootSuite, Quora and others, some for as many as four days.