Amazon's painful slog to fully fix EC2 now in Day 3

Company says work continues on 'unblocking the bottleneck'

While the number of Amazon customer sites still offline or suffering performance degradation may be unknown, there can be no doubting the level of frustration being felt as the company's efforts to fully restore its Elastic Compute Cloud (EC2) and Relational Database Service stretch into a third day.

(April 25: Where is Amazon's apology?)

And only time will reveal what if any impact this debacle will have on cloud computing in general.

From Amazon's Service Health Dashboard just before 5 a.m. this morning:

"We are continuing to work on unblocking the bottleneck that is limiting the speed with which we can re-establish connections between volumes and their instances. We will continue to keep everyone updated as we have additional information."

The company had posted this message just after midnight:

"We wanted to give a more detailed update on the state of our recovery. At this point, we have recovered a large number of the stuck volumes and are in the process of recovering the remainder. We have added significant storage capacity to the cluster, and storage capacity is no longer a bottleneck to recovery. Some portion of these volumes have lost the connection to their instance, and are waiting to be connected before normal operations can resume. In order to re-establish this connection, we need to allow the instances in the affected Availability Zone to access the EC2 control plane service. There are a large number of control plane requests being generated by the system as we re-introduce instances and volumes. The load on our control plane is higher than we anticipated. We are re-introducing these instances slowly in order to moderate the load on the control plane and prevent it from becoming overloaded and affecting other functions. We are currently investigating several avenues to unblock this bottleneck and significantly increase the rate at which we can restore control plane access to volumes and instances-- and move toward a full recovery."

In addition, Amazon promised that it will be offering a detailed explanation of exactly went wrong once the dust setles.

"The team has been completely focused on restoring access to all customers, and as such has not yet been able to focus on performing a complete post mortem. Once our customers have been taken care of and are fully back up and running, we will post a detailed account of what happened, along with the corrective actions we are undertaking to ensure this doesnt happen again. Once we have additional information on the progress that is being made, we will post additional updates."

As for reactions from those who might have been considering a move to the cloud, a Computerworld story on our site quotes a couple:

"We don't use Amazon or any other public cloud services and we won't, perhaps ever, or at least until there is much more transparency about where the data lives, who controls where it lives and when/where it moves, and lots of other things," said Jay Leader, the senior vice president and CIO of iRobot, whose products include the Roomba vacuum cleaner. Amazon's outage "just highlights why these are issues - just try to ask them what happened and what the impact was on your data, and even if they tell you, how do you know it's true and/or accurate?"

Paul Haugan, CTO of Lynnwood, Wash., said his city has been looking at Amazon's cloud offerings, but "the recent outage confirmed, for us, that cloud services are not yet ready for prime time."

Changing such perceptions is sure to take a long time.

Welcome regulars and passersby. Here are a few more recent Buzzblog items. And, if you'd like to receive Buzzblog via e-mail newsletter, here's where to sign up. Follow me on Twitter here.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2011 IDG Communications, Inc.