5 lessons from Amazon’s S3 cloud blunder – and how to prepare for the next one

Don’t put all your eggs in one cloud basket

beams sky cloud sun
Claudia Regina (CC BY-SA 2.0)

According to internet monitoring platform Catchpoint, Amazon Web Service’s Simple Storage Service (S3) experienced a three hour and 39 minute disruption on Tuesday that had cascading effects across other Amazon cloud services and many internet sites that rely on the popular cloud platform.

“S3 is like air in the cloud,” says Forrester analyst Dave Bartoletti; when it goes down many websites can’t breathe. But disruptions, errors and outages are a fact of life in the cloud. Bartoletti says there’s no reason to panic: “This is not a trend,” he notes. “S3 has been so reliable, so secure, it’s been the sort of crown jewel of Amazon’s cloud.“

+MORE AT NETWORK WORLD: Cloud showdown: Amazon Web Services vs. Microsoft Azure vs. Google Cloud Platform | Amazon’s S3 outage unleashes a flood of apologies – from others +

What this week should be though is a wake up call to make sure your cloud-based applications are ready for the next time the cloud hiccups. Here are five tips for preparing yourself for a cloud outage:

Don’t keep all your eggs in one basket

This advice will mean different things for different users, but the basic idea is that if you deploy an application or piece of data to one point in the cloud, it will not be very fault tolerant. Depending on how highly available you want your application to be will determine how many baskets you spread your workloads across. There are multiple options:

  • AWS recommends at a minimum to spread workloads across multiple Availability Zones. Each of the 16 regions that make up AWS are broken down into at least two, sometimes as many as five, AZs. Each AZ is meant to be isolated from other AZs in the same region. AWS provides low-latency connections between its AZs in the same region, creating the most basic way to distribute your workloads.
  • For increased protection, users can spread their applications across multiple regions.
  • The ultimate protection would be to deploy the application across multiple providers, for example using Microsoft Azure, Google Cloud Platform or some internal or hosted infrastructure resource as a backup.

Bartoletti says different customers will have different levels of urgency for doing this. If you rely on the cloud to make money for your business or its integral for productivity, you’d better make sure it's fault tolerant and highly available. If you use it to back up files that aren’t accessed frequently, then you may be able to live with the occasional service disruption.

ID failures ASAP

One key to responding to a cloud failure is knowing when one happens. AWS has a series of ways to do this. One of the most basic is to use what it calls Health Checks, which provide a customized view of the status of AWS resources used by each account. Amazon CloudWatch can be configured to automatically track service availability, monitor log files, create alarms and react to failures. One important precursor to this working is having a thorough analysis of what “normal” behavior is so that the AWS cloud tools can detect “abnormal” behavior.

Once an error is identified, there are a range of domino-effect reactions that need to be preconfigured to respond to the situation (see above on multi-AZ, multi-region, or multi-cloud). Load balancers can be in place to redirect traffic and backup systems can be kicked in if they’ve been set up to do so (see below).

Build redundant systems from the start

It will not be very useful to try to respond to an outage in real-time. Preparation before the outage will save you when it inevitably comes. There are two basic ways to build redundancy into cloud systems:

-Standby: When a failure occurs, the application automatically detects it and fails over into a backup, redundant system. In this scenario, the backup system can be off, but ready to spin up when an error is detected. An alternative is the standby backup can be running idly in the background the entire time (this costs more but will reduce failover time). The downside to these standby approaches is there could be a lag between when an error is detected and when the failover system kicks in.

-Active redundancy: To (theoretically) avoid downtime users can architect their application to have active redundancy. In this scenario, the application is distributed across multiple redundant resources: When one fails, the rest of the resources absorb a larger share of the workload. A sharding technique can be used in which services are broken up into components. Say, for example, an application runs across eight virtual machine instances – those eight instances can be broken up into four groups of two each and traffic can be load balanced between them. If one shard goes down, the other three can pick up the traffic.

Back data up

It’s one thing to have redundant systems, it’s another thing to back your data up. This would have been especially important in this week’s disruption because it first impacted Amazon’s most popular storage service, S3. AWS has multiple ways to natively back data up:

-Synchronous replication is a process in which an application only acknowledges a transaction (such as uploading a file to the cloud, or inputting information into a database) if that transaction has been replicated in a secondary location. The downside of this approach is that it can introduce latency to wait for the secondary replication to occur and for the primary system to get confirmation. When latency is not a priority, this is fine though.

-Asynchronous replication: This process decouples the primary node from the replicas, which is good for systems that need low latency write capabilities. Users should be willing to compromise some loss of recent transactions during failure in this scenario.

-Quorum-based replication: Is a combination of synchronous and asynchronous replication that sets a minimum amount of information that needs to be backed up for a transaction to be qualified.

To determine how best to build redundant systems and back data up, customers should consider their desired recovery point objective (RPO) and recovery time objective (RTO).

Test your system

Why wait for an outage to occur to see if your system is resilient to failure? Test it beforehand. It may sound crazy, but the best cloud architects are willing to kill whole nodes, services, AZs and even regions to see if their application can withstand it. “You should constantly be beating up your own site,” Bartoletti says. Netflix has open source tools named Chaos Monkey and Chaos Gorilla, which are part of its Simian Army that can automatically kill certain internal systems to test their tolerance to errors. Do they work? This week, Netflix didn’t report any issues with its service being down.

For more information related to AWS best practices on architecting for fault tolerance, check out this AWS Whitepaper.

Copyright © 2017 IDG Communications, Inc.

The 10 most powerful companies in enterprise networking 2022