This summer, multiple high-profile organizations have experienced embarrassing and financially costly business disruptions.
The explanations and excuses for these service interruptions—delivered by company executives and Monday Morning Quarterbacks alike—fail to address the underlying cause of these issues: lack of rigorous senior management oversight.
Southwest Airlines and Delta both experienced widespread consumer dissatisfaction and business outages over the last month due to what executives have blamed on equipment failures. Pundits blame the meltdowns on cobbled-together legacy infrastructure.
Both miss the point.
On July 20, 2016, Southwest Airlines IT systems went haywire due to a malfunctioning router, cancelling 700 flights and stranding thousands of passengers. Southwest Airlines CEO Gary Kelly characterized the outage as a “once-in-a-thousand-year flood.”
The difference between a thousand-year flood and a single IT equipment failure taking down a business is that the latter is entirely preventable.
Companies with complex IT systems employ safeguards against failure and multiple layers of protection and backup. Thus, when they fail it is due to much more than a single element or mistake. Most often a catastrophic, cascading failure is not due to a lack of standards or backup systems, but rather a failure of management.
Examining the Southwest Airlines and Delta outages
Let’s examine the recent outages and corporate responses.
Southwest estimated the financial hit will be tens of millions of dollars. Both Southwest’s pilots and mechanics unions are calling for Kelly’s resignation over the incident, as it touches upon long-simmering tensions about top-down cost-cutting.
One might assume the airline industry would quickly learn a lesson from this outage with its high-profile repercussions: millions of dollars in revenue lost, stock price negatively impacted, customers angered, and top executives called out in the media for poor management.
Yet less than a month later after the Southwest outages, a similar system failure struck Delta Airlines on Aug. 8.
According to the airline, “… a critical power control module at our Technology Command Center malfunctioned, causing a surge to the transformer and a loss of power. The universal power was stabilized and power was restored quickly. But when this happened, critical systems and network equipment didn’t switch over to backups. Other systems did. And now we’re seeing instability in these systems.”
Some of the first reports blamed switchgear failure or a generator fire for the outage. Later reports suggested that critical services were housed on single-corded servers or that both cords of dual-corded servers were plugged into the same feed, which would explain why backup power failed to keep some critical services on line.
In Uptime Institute’s field experience and in our advisory role with clients, we find IT staff far too often deploy single-corded IT equipment or mistakenly install equipment with dual power supplies into a single power path, defeating millions of dollars spent on facility systems redundancy through carelessness or ignorance.
In this instance, a small percentage of servers lost power, starting a cascade of outages in dependent systems, resulting in hundreds of canceled and delayed flights.
Delta’s IT problems stretched for days with hundreds of thousands of passengers stranded in airports around the globe. Airline analyst Helane Becker estimated the airline will suffer a $120 million operating income loss from the outage.
According to the Associated Press, “Delta Air Lines CEO Ed Bastian apologized for the meltdown and said that while he knew the airline needed to make technological investments—an updated mobile app for instance—‘we did not believe, by any means, that we had this type of vulnerability.’”
The CEO of Delta Airlines doesn’t need to be an expert in predicting the lifecycles of data center infrastructure or trailing every server cord to the outlet, but he or she needs to have the transparency and accountability in the reporting chain to ensure processes and management structures are in place and followed in order to prevent or mitigate against these issues.
At the end of the day, the power outage happened for practical and predictable reasons that aren’t sexy and aren’t attended to. A few hundred servers weren’t plugged into the right outlets—these are basic power distribution management principles.
Delta invested in multiple power paths for its data center—the system was designed to survive failure. They had everything in place to sustain customer service, but a lack of processes or enforcement of processes defeated the investment.
Addressing complex systems failures
Large industrial and engineered systems are risky by their very nature. The greater the number of components and the greater the skill and teamwork required to plan, manage and operate the systems safely. Between mechanical components and human actions, there are thousands of possible points where an error can occur and potentially trigger a chain of failures.
Complex system breakdowns usually begin when one component or element of the system fails, requiring nearby “nodes” (or other components in the system) to take up the workload or service obligation of the failed component. If this increased load is too great, it can cause other nodes to overload and fail as well, creating a waterfall effect as every component failure increases the load on the other, already stressed components.
Although operator error or single equipment failure may sometimes appear to cause an incident, a single incident is not sufficient to bring down a robust system unless conditions are such that the system is already teetering on the edge of critical failure and has multiple underlying risk factors left untended by management.
Complete failure requires multiple faults, thus attribution of blame to a single isolated element is myopic and, arguably, scapegoating.
Most often a catastrophic failure is not due to a lack of standards, but a breakdown or circumvention of established procedures that compounded into a disastrous outcome.
Multilayer complex systems outages signify management failure to drive change and improvement.
The responsibility for cascading failures flows from the top down. Leadership decisions and priorities manifest themselves at the most critical levels: inadequate staffing and training, an organizational culture that becomes dominated by a reactive mentality, or budget cutting that reduces preventive/proactive maintenance.
Uptime Institute has assessed the world’s elite IT and data center operations to validate that organizations have the procedures, accountability and transparency in place to ensure long-term performance of data center assets.
These evaluations ensure that management equips frontline operators with resources they need to mitigate risk and respond appropriately when small failures occur to avoid having them cascade into large critical failures.
If executive leadership, operators and oversight agencies adhered to their own policies and requirements and did not cut corners for economics or expediency, many disasters could be avoided.
Legacy blame game and the fallacy of modernity
According to the Wall Street Journal, over the past three years, Delta has spent “hundreds of millions” in IT infrastructure upgrades and systems, including $150 million this year alone.
“And earlier this year [Delta] named a new chief information officer and has brought in new leaders for its information technology and infrastructure team,” the Wall Street Journal writes.
Yet the conventional wisdom from the media is that the airline systems are retrograde and fragile.
According to a column in The Economist, “Airlines’ systems are so fragile because of their age and complexity. … As airlines merged and more new functions were added they have come to resemble technological hairballs in which one small problem suddenly spins into bigger ones that even experts struggle to disentangle.”
The column goes on to claim that the problem is fundamentally unsolvable—too costly and complex for even the largest and most sophisticated IT firms to address.
With the capabilities, technologies and funding available to the IT architects, the idea that the airlines are forever trapped in a legacy technology death spiral fails to convince. But it also misses the point.
These systems would have failed in the 1980s for the exact same reason they failed today. Nearly all IT systems are fragile when the power crashes.
By that standard, are today’s cloud computing systems fragile?
While cloud providers have worked to architect applications that are resilient and instantly transferable in the event of hardware failures, the overwhelming evidence suggests that when the power drops, customers suffer.
Report after report in the news document data center facility incidents translating to cloud service interruptions. The cloud sounds modern and flexible, but ultimately there’s a data center somewhere.
In recent years, industry pundits have claimed data center designs with redundant power paths are on the decline. Based on our extensive field experience certifying 1,000 data center designs around the globe, the evidence continues to support infrastructure resiliency and redundancy.
Ask the companies delivering data center capacity to the largest cloud vendors what level of infrastructure resiliency they are building to, given the stakes. Concurrently maintainable, dual path infrastructure is the norm. The focus on Fault Tolerance (Systems + System) is tapering, but “single thread” infrastructure is a risk only very few are willing to take.
Yet the site infrastructure is only as good as the management team empowered to run it—whether that’s in a cloud or an airline IT department.
The lessons to be learned from this recent spate of IT outages is that you can’t buy a culture of transparency and continuous improvement from a vendor catalog. You don’t address risk just by throwing more infrastructure at a problem. But rather, IT organizations need to ensure that their people are adequately trained and resourced. They need to ensure procedures are documented and followed. How are critical assets maintained and tested?
Now the U.S. Congress is getting involved.
“In a letter sent to executives at 13 airlines, Democratic Sens. Edward Markey and Richard Blumenthal outlined 10 questions regarding recent disruptions, the state of airlines' technology systems and how airlines accommodate passengers during an outage,” reported the Dallas Morning News.
As the executives and politicians try to analyze and recommend how to prevent or mitigate future airline IT outages, we hope they will look at the management principles behind these failures rather than the single points of failure currently being cited for the losses.