3 Big takeaways from Amazon’s latest cloud outage

Is Amazon's latest cloud outage stirs up a debate in the cloud

stormclouds danger warning
Credit: Thinkstock

In the wee hours of Sunday morning something went very wrong in an Amazon Web Services data center.

At 6 AM ET error rates for the company’s massive NoSQL database named DynamoDB began skyrocketing in AWS’s US-East Virginia region - the oldest and largest of its nine global regions. By 7:52 AM ET, AWS determined the cause of the problems: an issue with how the database manages metadata had gone awry, impacting the service’s partitions and tables.

screen shot 2015 09 21 at 9.45.01 am Amazon Web Services

Amazon Web Service's Health Dashboard shows the timeline of events from Sunday's outage, including the root cause. 

Because of the intricate interconnectivity of AWS’s services, the issue snowballed to impact 34 total services (out of 117) that the company’s Service Health Dashboard monitors. Everything from Elastic Compute Cloud (EC2) virtual machines to the Glacier storage service to its Relational Database Service were impacted. According to media reports, other companies that rely on AWS experienced outages too, ranging from Netflix to IMDB, to Tinder, Pocket and Buffer.

By noon on Sunday AWS reported the issue was resolved, but not without numerous complaints and musings on Twitter and elsewhere.

What can we takeaway from this event? Below are some thoughts

Even the big boys fail

Amazon Web Services is the kingpin of the public IaaS cloud market – although Microsoft seems to be giving the company a run for its money. Sunday’s events remind us that even big, established cloud vendors are still vulnerable to outages.

Prepare for outages

Given that even the most mature cloud offering on the market can still have a six-hour plus service disruption, customers should prepare for this stuff. AWS has for a long time advised customers to architect their systems to handle virtual machines and other services going down.

+MORE AT NETWORK WORLD: How to prepare for the next big cloud outage +

screen shot 2015 09 21 at 11.02.38 am DownDetector.com

DownDetector.com showed higher-than-normal error reports for Netflix on Sunday morning. A company spokesperson denied that the service was significantly impacted.  

Netflix, perhaps one of Amazon’s biggest brand-name cloud customers, said via a spokesperson that the impact of the outage on the company’s services was minimal because it migrated workloads automatically from the troublesome US-East region to another healthy region upon learning of the disruption. Anyone who uses AWS for mission critical apps should architect their system with the expectation that the services that run it could fail at any time. Netflix has developed open source tools to help test its system for random crashes. Despite Netflix not acknowledging a major issue for customers, third-party outage tracking sites reported higher-than-normal reports of service disruption from Netflix users Sunday morning. Even the well-prepared can be impacted by these issues. 

“I told you so”

A blogger at Forbes argues that this outage changes nothing. I basically agree with this. If you’re an AWS fanboy then you will say that these outages are less frequent then they used to be and that if you heed AWS’s best practices then these situations will not impact you.

On the other side of the coin, outages like what happened Sunday will only be further fodder for folks who are weary to send workloads to the public cloud.

The fact is outages happen. They happen in the public cloud, across any and all providers, and they happen in internal data centers that companies run too. They’re just a fact of life for IT.

To comment on this article and other Network World content, visit our Facebook page or our Twitter stream.
Must read: Hidden Cause of Slow Internet and how to fix it
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.