Amazon's painful slog to fully fix EC2 now in Day 3

Company says work continues on 'unblocking the bottleneck'

While the number of Amazon customer sites still offline or suffering performance degradation may be unknown, there can be no doubting the level of frustration being felt as the company's efforts to fully restore its Elastic Compute Cloud (EC2) and Relational Database Service stretch into a third day.

(April 25: Where is Amazon's apology?)

And only time will reveal what if any impact this debacle will have on cloud computing in general.

From Amazon's Service Health Dashboard just before 5 a.m. this morning:

"We are continuing to work on unblocking the bottleneck that is limiting the speed with which we can re-establish connections between volumes and their instances. We will continue to keep everyone updated as we have additional information."

The company had posted this message just after midnight:

"We wanted to give a more detailed update on the state of our recovery. At this point, we have recovered a large number of the stuck volumes and are in the process of recovering the remainder. We have added significant storage capacity to the cluster, and storage capacity is no longer a bottleneck to recovery. Some portion of these volumes have lost the connection to their instance, and are waiting to be connected before normal operations can resume. In order to re-establish this connection, we need to allow the instances in the affected Availability Zone to access the EC2 control plane service. There are a large number of control plane requests being generated by the system as we re-introduce instances and volumes. The load on our control plane is higher than we anticipated. We are re-introducing these instances slowly in order to moderate the load on the control plane and prevent it from becoming overloaded and affecting other functions. We are currently investigating several avenues to unblock this bottleneck and significantly increase the rate at which we can restore control plane access to volumes and instances-- and move toward a full recovery."

In addition, Amazon promised that it will be offering a detailed explanation of exactly went wrong once the dust setles.

"The team has been completely focused on restoring access to all customers, and as such has not yet been able to focus on performing a complete post mortem. Once our customers have been taken care of and are fully back up and running, we will post a detailed account of what happened, along with the corrective actions we are undertaking to ensure this doesnt happen again. Once we have additional information on the progress that is being made, we will post additional updates."

As for reactions from those who might have been considering a move to the cloud, a Computerworld story on our site quotes a couple:

"We don't use Amazon or any other public cloud services and we won't, perhaps ever, or at least until there is much more transparency about where the data lives, who controls where it lives and when/where it moves, and lots of other things," said Jay Leader, the senior vice president and CIO of iRobot, whose products include the Roomba vacuum cleaner. Amazon's outage "just highlights why these are issues - just try to ask them what happened and what the impact was on your data, and even if they tell you, how do you know it's true and/or accurate?"

Paul Haugan, CTO of Lynnwood, Wash., said his city has been looking at Amazon's cloud offerings, but "the recent outage confirmed, for us, that cloud services are not yet ready for prime time."

Changing such perceptions is sure to take a long time.

Welcome regulars and passersby. Here are a few more recent Buzzblog items. And, if you'd like to receive Buzzblog via e-mail newsletter, here's where to sign up. Follow me on Twitter here.

  • New Google feature blows Obama birthplace lie out of the water.
  • Wozniak questions long-accepted date of “Day One” at Apple.
  • IRS e-file system turns 25 … and tops 70% participation rate
  • 35 years of ‘Apple’ Fools Day fun
  • Groupon vs. the price of gasoline.
  • On the company dime: Rogue game server admins tell all
  • World of Warcraft player offers $1,000 bribe.
  • Tech ‘firsts’ that made a President’s day.
  • If you had bought 100 shares of Microsoft 25 years ago …
  • 300,000 clients of umbilical cord blood bank at risk of ID theft
  • No e-wallet can replace a John Wayne.
  • In dog-bites-man news, Bank of America Web site fails again.

Copyright © 2011 IDG Communications, Inc.

The 10 most powerful companies in enterprise networking 2022