Network issues are causing more data-center outages

As enterprise computing environments grow more complex, IT system failures and network errors are bringing down data centers in greater numbers, causing more unplanned downtime.

Power failures are a common cause of data-center outages, but they’re not the only culprit. As enterprise computing environments grow more complex, IT system and network failures are bringing down data centers in greater numbers.

The Uptime Institute has been studying publicly reported outages to track what’s causing unplanned downtime. Over the past three years, it has culled information from 163 outages reported in traditional media or on social media. During that time, the amount of data available has steadily climbed; researchers collected data from 27 outages in 2016, 57 outages in 2018, and 78 outages in 2018.

“Public outages make the news with ever increasing regularity,” said Andy Lawrence, executive director of research at Uptime Institute, which offers resiliency services, advice on building and running data centers, and certification services.

The industry is now recording “significant outages on a near daily basis somewhere around the world,” Lawrence said as the group unveiled its research findings. That doesn’t necessarily mean that the number of outages is spiking, but downtime is gaining more attention and, “it’s clear to us that the impact of outages is certainly increasing,” he said.

One key finding from Uptime Institute’s research: Power is less implicated in overall failures, while the network and IT systems are more implicated.

One reason for the shift is that power systems are performing more reliably than they have in the past, which is reducing the number of on-premises data-center power failures.

Over the past two decades, the tech industry has focused on how to design power systems in way that allows IT assets to continue to operate even if there’s a fault or failure somewhere in the power system, said Chris Brown, CTO of Uptime Institute. “The advent of 2N electrical distribution systems feeding dual-corded IT equipment allows systems in IT to continue to operate through a number of single incidents and events,” Brown said.

Meanwhile, the increasing complexity of IT environments is leading to greater numbers of IT- and network-related problems. “Data now is spread across multiple places with some critical dependencies upon the network, the way that applications [are architected], and the way that databases replicate. It’s a very complex system, and it takes less today to perturb that system than perhaps in years past,” said Todd Traver, Uptime Institute’s vice president of IT optimization and strategy.

Rating the severity of data-center outages

To distinguish between an outage that threatens to bring down the business and one that is merely an inconvenience, Uptime Institute has come up with a scale. The rating system allows researchers to see how patterns change over time, Lawrence said. Uptime Institute’s scale has five tiers:

  • Level 1 is a negligible outage. The outage is recordable, but there’s little or no obvious impact on services and no service disruptions.
  • Level 2 is characterized as a minimal service outage. Services are disrupted, but there’s minimal effect on users, customers or reputation.
  • Level 3 is a business-significant service outage. It involves customer or user service interruptions, mostly of limited scope, duration or effect. There’s minimal to no financial impact. Some reputational or compliance impact is incurred.
  • Level 4 is a serious business or service outage. Disruption of service and/or operations is involved. Ramifications include some financial losses, compliance breaches, reputation damage and possibly safety concerns. Customer losses are possible.
  • Level 5 is a business- or mission-critical outage involving major and damaging disruption of services and/or operations. There are possible large financial losses, safety issues, compliance breaches, customer losses and reputational damage.

When Uptime Institute examined all publicly reported data center-outages (Levels 1 to 5) over the three-year period, IT system and network problems outstripped power as the primary cause (see graphic).

data center outages pie chart

To continue reading this article register now

Now read: Getting grounded in IoT