Recent Amazon outage highlights need for cloud automation

How did a handful of businesses go undisturbed by the service interruption? With robots, of course.

Recent Amazon outage highlights need for cloud automation
Credit: Thinkstock

As most internet users are aware, last week Amazon faced one of its largest service outages since the launch of Amazon Web Services (AWS). The list of disrupted businesses read like a dire who's who of the internet, from Netflix to Pinterest to Airbnb. The cause of the AWS S3 outage appears to be a fat-finger typo by an authorized Amazon system administrator who was troubleshooting an unrelated problem.

It happens, and it happens often.

According to research from Ponemon Institute in 2016, at least 22 percent of data center outages each year are caused by human error. Outages have far-ranging impacts, from business disruption and lost revenue, to end user productivity. The average cost of an outage has increased by 38 percent since 2010 from $505,502 to $740,357 in 2016.

+ Also on Network World: 5 lessons from Amazon’s S3 cloud blunder – and how to prepare for the next one +

The fact that Amazon has not experienced many more outages like this so far is a testament to just how good their processes truly are. Apparently, though, the public cloud is not going to save us from human error. We should all have a contingency for these inevitable outages. One of the most striking features of this outage was just how businesses had such a plan in place.

Many just waited on Amazon to fix the problem and took the cue to take a break, go outside and see the sunshine. Let's call that "service provider induced learned helplessness," and it can happen when your service provider is excellent, even superb. It is laboring luxuriously under the delusion that your service provider will always be there to mitigate your disaster and that your operational responsibility ends with their SLA. Nice work if you can get it.

Others, as frantic Twitter and forum chronologies show, worked furiously to restore their sites as fast as possible. A few just flipped a switch to their backup and quietly went on with their day, and a handful flipped no switch at all. How did they do it?

Kubernetes detected the outage automatically

Rob Scott, vice president of software at the engagement company Spire, described a "sense of awe watching the automatic mitigation as it happened" using Kubernetes. Kubernetes, an open-source project originally from Google, can orchestrate complex multi-tier applications in near real time. In Spire's case, Kubernetes detected the outage immediately with active monitoring, automatically replacing failed servers with new ones in another availability zone.

Kubernetes has seen a lot of activity recently, with dozens of vendors piling on as partners and contributors. Although the system is maturing rapidly, Kubernetes is known for its complexity, and getting the system running can still be a real challenge. A recent release, version 1.4, attempts to simplify Kubernetes deployment with a new tool called kubeadm.

Other open-source projects such as OASIS TOSCA, Hashicorp's Terraform and Docker's Compose, take a different approach. In this model, system administrators predefine the desired state using a high-level programming or configuration language. There are many advantages to this method. Changes are implemented in code and placed into software revision control systems like git. System administrators rely upon the orchestrator to converge the cloud environment to the target state automatically. Upgrading an entire environment to new versions of application servers can be as easy as running a single command.

Despite the availability of so many excellent tools, the real-world difficulty of running failover and replication in the cloud was still a common complaint in postmortem discussion around the internet. The complexity of even a single cloud service provider like Amazon is not easily conquered by a single tool. There is still a multi-year battle between numerous vendors and open-source projects over cloud orchestration, and as of yet, there is still no clear winner. This situation leaves developer and IT teams in the precarious position of needing to make a rather risky bet on the future of cloud automation.

The pillar of cloud automation: containers

At this point, the safest best is still on containerization being a pillar of the automated future. It is now almost a foregone conclusion that containers will be the de facto packaging for microservices (and everything else) going forward, so the work of containerizing will surely pay dividends for IT and development teams. Just take care to avoid overinvesting in a solution that strays too far from the mindset of the underlying containerization layer.

This article is published as part of the IDG Contributor Network. Want to Join?

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Must read: 10 new UI features coming to Windows 10