Updated: Amazon says it's getting a handle on EC2 outage

Updated (7): Disruption in cloud service rains on Reddit, Foursquare, others

amazon logo
Amazon reports this morning that it is making progress in restoring full service to customers of its Elastic Compute Cloud (EC2) and Relational Database Service in the eastern portion of the country after a rocky stretch of trouble that began sometime before midnight.

(Amazon EC2 debacle drags into Day 2)

(And Day 3)

(Note, 2:30: Updates at the bottom of this post show at least some aspects of the problem getting worse; Amazon's explanations getting longer. Fix time estimate: "Our high-level ballpark right now is that the ETA is a few hours.") 

Among those customers feeling Amazon's pain have been Foursquare, Reddit (I'm unable to log in at the moment), Quora and Hootsuite.  

Here's the latest from Amazon: "5:02 AM PDT Latency has recovered for a portion of the impacted (Elastic Block Storage) volumes. We are continuing to work to resolve the remaining issues with EBS volume latency and error rates in a single Availability"

The company has done a good job of keeping those feeling the impact informed. From its Service Health Dashboard:

1:41 AM PDT We are currently investigating latency and error rates with EBS volumes and connectivity issues reaching EC2 instances in the US-EAST-1 region.

2:18 AM PDT We can confirm connectivity errors impacting EC2 instances and increased latencies impacting EBS volumes in multiple availability zones in the US-EAST-1 region. Increased error rates are affecting EBS CreateVolume API calls. We continue to work towards resolution.

2:49 AM PDT We are continuing to see connectivity errors impacting EC2 instances, increased latencies impacting EBS volumes in multiple availability zones in the US-EAST-1 region, and increased error rates affecting EBS CreateVolume API calls. We are also experiencing delayed launches for EBS backed EC2 instances in affected availability zones in the US-EAST-1 region. We continue to work towards resolution.

3:20 AM PDT Delayed EC2 instance launches and EBS API error rates are recovering. We're continuing to work towards full resolution.

4:09 AM PDT EBS volume latency and API errors have recovered in one of the two impacted Availability Zones in US-EAST-1. We are continuing to work to resolve the issues in the second impacted Availability Zone. The errors, which started at 12:55AM PDT, began recovering at 2:55am PDT

Quora has this message on its Web site: "We're currently having an unexpected outage, and are working to get the site back up as soon as possible. Thanks for your patience." According to a number of reports, that perfunctory note replaced this more candid one posted earlier: "We'd point fingers, but we wouldn't be where we are today without EC2."

As would be expected, Twitter is all abuzz about Amazon's trouble, with one wag stating the obvious: "You can just see the wave of cloud computing headlines that will follow today's Amazon EC2 problems."

(Latest from Amazon just after 9 a.m. here on the East Coast: "6:09 AM PDT EBS API errors and volume latencies in the affected availability zone remain. We are continuing to work towards resolution.")

(Update 2: Regarding the database service issue: "6:29 AM PDT We continue to work on restoring access to the affected Multi AZ instances and resolving the IO latency issues impacting RDS instances in the single availability zone.")

(Update 3: Latest reports on EC2 would seem to indicate that at least some aspects of the situation are worsening:

"6:59 AM PDT There has been a moderate increase in error rates for CreateVolume. This may impact the launch of new EBS-backed EC2 instances in multiple availability zones in the US-EAST-1 region. Launches of instance store AMIs are currently unaffected. We are continuing to work on resolving this issue.

"7:40 AM PDT In addition to the EBS volume latencies, EBS-backed instances in the US-EAST-1 region are failing at a high rate. This is due to a high error rate for creating new volumes in this region.")

(Update 4, 11:15 a.m.: Just noticed that the icon for EC2 on the Service Health Dashboard has changed from "performance issues" to "service disruption," which would seem to indicate that matters are getting worse.)

(Update 5, just before noon here: "8:54 AM PDT We'd like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it's difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We're starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.")

(Update 6: "10:26 AM PDT We have made significant progress in stabilizing the affected EBS control plane service. EC2 API calls that do not involve EBS resources in the affected Availability Zone are now seeing significantly reduced failures and latency and are continuing to recover. We have also brought additional capacity online in the affected Availability Zone and stuck EBS volumes (those that were being remirrored) are beginning to recover. We cannot yet estimate when these volumes will be completely recovered, but we will provide an estimate as soon as we have sufficient data to estimate the recovery. We have all available resources working to restore full service functionality as soon as possible. We will continue to provide updates when we have them.

11:09 AM PDT A number of people have asked us for an ETA on when we'll be fully recovered. We deeply understand why this is important and promise to share this information as soon as we have an estimate that we believe is close to accurate. Our high-level ballpark right now is that the ETA is a few hours. We can assure you that all-hands are on deck to recover as quickly as possible. We will update the community as we have more information.")

(Update 7, 4 p.m.: "12:30 PM PDT We have observed successful new launches of EBS backed instances for the past 15 minutes in all but one of the availability zones in the US-EAST-1 Region. The team is continuing to work to recover the unavailable EBS volumes as quickly as possible.")

(Update 8, 5 p.m.: Amazon reporting progress: "1:48 PM PDT A single Availability Zone in the US-EAST-1 Region continues to experience problems launching EBS backed instances or creating volumes. All other Availability Zones are operating normally. Customers with snapshots of their affected volumes can re-launch their volumes and instances in another zone. We recommend customers do not target a specific Availability Zone when launching instances. We have updated our service to avoid placing any instances in the impaired zone for untargeted requests."

Please check Amazon's Service Health Dashboard directly for additional updates.) 

Welcome regulars and passersby. Here are a few more recent Buzzblog items. And, if you'd like to receive Buzzblog via e-mail newsletter, here's where to sign up. Follow me on Twitter here.

From CSO: 7 security mistakes people make with their mobile device
Join the discussion
Be the first to comment on this article. Our Commenting Policies