British Airways’ outage, like most data center outages, was caused by humans

The technology fails only because the people behind it failed in some capacity

British Airways’ IT outage was caused by humans
Credit: British Airways

An IT outage on May 27 that caused British Airways (BA) to cancel more than 400 flights and strand 75,000 passengers in one day was because of human error—and a simple one at that.

An engineer had disconnected a power supply at a data center near London’s Heathrow airport, and when it was reconnected, it caused a surge of power that resulted in major damage, according to Willie Walsh, CEO of BA’s parent company IAG SA. Walsh made the comment to reporters in Mexico, and it was picked up by Bloomberg and other news outlets.

+ Also on Network World: We’re learning the wrong lessons from airline IT outages +

The engineer in question had been authorized to be on site and was part of a team working at the Heathrow data center hit by the power outage. The facility is managed by CBRE Works Solutions, a U.S. property services company.

A BA spokesperson told the U.K. publication IT PRO, "There was a loss of power to the U.K. data center, which was compounded by the uncontrolled return of power, which caused a power surge taking out our IT systems. So we know what happened; we just need to find out why. It was not an IT failure and had nothing to do with outsourcing of IT; it was an electrical power supply which was interrupted."

An internal email sent by the head of group IT at IAG, which was leaked to the Press Association, a U.K. news group similar to the Associated Press in the U.S., said, "This resulted in the total immediate loss of power to the facility, bypassing the backup generators and batteries. ... It was turned back on in an unplanned and uncontrolled fashion, which created physical damage to the system."

A spokesperson for CBRE said, however, the cause for the outage is still to be determined.

“We are the manager of the facility for our client BA and fully support its investigation. No determination has been made yet regarding the causes of the incident on May 27," the spokesperson said.

This was no small accident. It’s estimated to have cost BA as much as 100 million euros (U.S. $112 million) to say nothing of the black eye BA got for the outage. 

This isn’t an isolated incident. Most recently, in March, Amazon Web Services suffered a massive outage when one of its employees was debugging an issue with the billing system and accidentally took more servers offline than he intended. That error started a cascade effect that took down other systems, resulting in the outage.

Humans the cause for most data center failures

To err is human, and we err a lot.

A 2016 study by the Ponemon Institute found human error was the chief cause of failure, accounting for 22 percent of data center outages, while water, heat or air conditioning failure accounted for 11 percent of outages, weather accounted for 10 percent and generator failures were 6 percent. IT equipment malfunction accounted for only 4 percent of all outages.

That’s because the IT industry doesn’t do a good job at educating its workers on proper processes for managing this equipment. Two-thirds of data center outages are related to processes, not infrastructure systems, according to David Boston, director of facility operations solutions for TiePoint-bkm Engineering.

“Most are quite aware that processes cause most of the downtime, but few have taken the initiative to comprehensively address them. This is somewhat unique to our industry,” he told Data Center Knowledge.

This matches another long-standing problem in corporate America: failure to educate users on acceptable practices for their laptop and smartphone. It’s well established that most corporate breaches come not from external hackers or even internal malcontents, but employees making stupid mistakes, such as opening phishing emails. 

Management needs to stop assuming people are computer literate, know the technology as well as their kids, and are mind readers when it comes to policy. Taking some time to train and educate people on what is expected of them should not be a special consideration, it should be standard operating procedure. Clearly it is not at the moment.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Must read: 10 new UI features coming to Windows 10