Why Netflix didn’t sink when Amazon S3 went down

When a typo caused an outage in part of Amazon Simple Storage Service’s (S3) cloud infrastructure on February 28, many of their customers’ websites went down but Netflix survived unblemished.

Two men paddling boat made of money against blue sky
Thinkstock

On Tuesday, February 28, 2017, a typo brought many websites and online applications to their knees. A simple human error at Amazon S3, the backend for 150,714 websites, caused an outage.

The S3 billing system had been acting sluggish, and associates were trying to debug it. One of them executed a command to remove a few servers from one of the subsystems that the S3 billing process used. With the errant stroke of a finger, he or she caused collateral damage, taking out more servers than intended.

Servers are slow to recover

Unfortunately, while the servers rapidly obeyed the command to go down, they were not as ready to wake up and come back to work. Perhaps you can relate to this. You know how long it takes to boot up your computer at the beginning of the day. Well, apparently a full restart of Amazon’s servers is even more taxing. As the servers struggled to get back into action, the outage dragged on for more than four hours.

The value of preparation

Many small business’ online operations were vulnerable to Amazon’s downtime event as well as some companies that bear household names — Netflix, Pinterest, Spotify, and Buzzfeed. But while some Amazon S3 customers fell victim to the outage, others didn’t miss a beat.

Why the dichotomy in performance? Some had relied on the 99.99 percent availability that Amazon’s S3 Service Level Agreement promises. Others either decided .01 percent unavailability was too much to bear or, perhaps, believed in Murphy’s Law, “Anything that can go wrong, will go wrong.” They shaped their cloud IT infrastructure strategy accordingly.

It seems that IFTTT is part of the “99.99 percent availability is good enough for us” group. This cloud-based service automates the little things in people’s lives. As IFTTT took a long afternoon siesta, users complained that they could not, for instance, turn on their lights when they got home. Conversely, at Netflix, a company that appears to put their faith in Murphy’s Law, it was smooth sailing.

Perhaps the wisdom to prepare for a worst-case scenario came from experiencing it in the past. After a 2012 storm-related power outage at Amazon during which Netflix suffered through three hours of downtime, a Netflix engineer noted that the company had begun to work with Amazon to eliminate “single points of failure that cause region-wide outages.” They understood it was the company’s responsibility to ensure Netflix was available to entertain their customers no matter what. It would not suffice to blame their cloud provider when someone could not relax and watch a movie at the end of a long day.

Plus, they had experienced the full impact of a business interruption on their bottom line. A 2014 report shows that the cost of one hour of downtime for Netflix is $200,000. It’s likely that amount has grown since then. Thus, more than four hours of downtime during the recent event would have cost them over $800,000. Also, it would have tarnished their brand reputation.

A reliable cloud architecture

So Netflix made their move to the cloud slowly and carefully. The result is that their network is in 12 Amazon Web Services regions globally, each of which has multiple “Availability Zones.” Each zone has at least one data center along with associated power, networking and connectivity. Because these zones are connected to each other, Netflix has been able to design their cloud infrastructure in such a way that their applications switch between zones automatically when failures occur, avoiding service disruptions.

Not all cloud providers offer multiple zones. If you’re not with Amazon, however, this doesn’t mean you have to switch your provider to achieve this level of redundancy. Design your cloud environment according to your requirements, and select at least two cloud providers in separate geographic regions to provide the IT infrastructure you need. Connect them for automatic failovers. By doing so, you create a virtual Noah’s Ark that ensures availability.

Of course, for an IT architecture like this to work, you cannot max out capacity in either location. Perhaps you use 40 percent of one cloud provider’s maximum load and 40 percent of another’s. Yes, it costs more. But as with every business decision, you have to weigh the costs and benefits. Calculate how much downtime would cost your business in revenues, reputation, and productivity. How does this expense compare to paying for the redundancy? You might find it yields a positive ROI.

Stay vigilant

Remember, however, your cloud infrastructure is not something you set and forget. You must remain vigilant. Stay aware of your changing technology requirements. As your business grows, make sure you have enough capacity at each vendor or within each zone at a single vendor to switch over without a hitch when a service failure occurs.

The best way to keep a handle on your IT performance and capacity is to monitor your environment 24/7/365. The IT infrastructure monitoring tool you use to do this should provide warnings that let you know when your business is bumping up against critical performance thresholds. Such information allows you to make changes as necessary to ensure you’re always prepared for Murphy’s Law to strike. After all, even for a few hours, you don’t want a typo to lay waste to your business.

This article is published as part of the IDG Contributor Network. Want to Join?

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Now read: Getting grounded in IoT