How did Facebook go down despite multiple data centers?

You build in redundancy for a reason, but in some cases it can backfire.

Facebook / privacy / security / breach / wide-eyed fear
Pete Linforth / The Digital Artist (CC0)

The Mercury retrograde kicked in big time on Wednesday as Facebook suffered an eight hour-outage that also affected Instagram and Facebook Messenger.

No one was believed to be harmed; a few might have even had offline interactions with other human beings.

Facebook said it wasn’t an attack, like a Denial of Service attack, and has since issued a statement attributing it to a configuration error.

“Yesterday, we made a server configuration change that triggered a cascading series of issues. As a result, many people had difficulty accessing our apps and services," said Travis Reed, a Facebook spokesman. "We have resolved the issues, and our systems have been recovering over the last few hours. We are very sorry for the inconvenience and we appreciate everyone’s patience.”

The question for me is how could a company with redundant data centers around the U.S., not to mention internationally, be taken down like this? All told it has seven data centers in the U.S. Redundancy is supposed to help prevent this kind of problem.

Well, not exactly. In the case of a bug or operating problem, redundancy doesn’t help. In fact, it can spread the problem quickly, notes analyst Rob Enderle.

“Redundancy can help with certain things like a complete system failure, but it doesn’t help with a virus or software bug because it can replicate it, so redundancy can’t help here,” he said.

A software bug shouldn’t have affected Instagram and Messenger, but Enderle figures that the problem was related to a shared-code issue, and whatever it was that failed used the same code or a derivative of that code, so it replicated across all services. “At the very least they should have firewalled the services to avoid something like this,” he said.

Still, Enderle thinks something else going on here because in this day and age, an eight-hour outage shouldn’t last this long unless you are under attack, and Facebook said it  was not under attack. “They should have rolled back whatever it was in minutes. It’s not like they are a novice company. This should not have happened,” he said.

And given the trust issues Facebook has had, it’s in the best interests of the company to come clean, if only for the sake of its advertisers. Most of us just went on with our day.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Related:
Now read: Getting grounded in IoT