Data-center outages: Causes are changing, report says

Power issues are less likely to cause a major IT service outage, while IT configuration and network problems are becoming more common, according to the Uptime Institute.

newspaper on fire inherit it mess fire risk alert disaster data center network room by elijah odonn
Elijah O'Donnell / Unsplash / Modified by IDG Comm. (CC0)

A new survey by the Uptime Institute found that power issues are becoming less of a problem for data center operators, but networking and software issues are emerging as an increasingly bigger problem.

The Uptime Institute's third Annual Outage Analysis notes that while improvements have been made with technology and availability, outages remain a major industry, customer, and regulatory concern. 

The report also shows that the overall impact and direct and indirect cost of outages continue to grow. When asked about their most recent significant outage, more than half of respondents reported an outage in the past three years and estimated its cost at more than $100,000; among those respondents, almost one-third reported costs of $1 million or above.

The trend is only natural. In the past, your data center was your IT infrastructure. Now add cloud services providers and SaaS. If Outlook 365 has an outage, you have an outage. If AWS has an outage, you have an outage.

“Resiliency remains near the top of management priorities when delivering business services,” said Andy Lawrence, executive director of research for the Uptime Institute, in a statement. “Overall, the causes of outages are changing; software and IT configuration issues are becoming more common, while power issues are now less likely to cause a major IT service outage.”

Uptime notes that although there were significant disruptions affecting financial trading, government services, internet and telecom, the outages that made headlines in 2020 were often about the impact to consumers and workers at home, with interruptions to applications such as Microsoft Exchange and Teams, Zoom, fitness trackers and the like.

Some of the findings from Uptime’s 2020 survey include:

  • Almost half (44%) of data center operators surveyed think that concern about resiliency of data-center/mission-critical IT has increased in the past twelve months.
  • Serious and severe outages are less common (one in six reported having one in the past three years) but can have catastrophic results for stakeholders. Vigilance and investment are necessary.
  • More than half (56%) of all organizations using a third-party data service have experienced a moderate or serious IT service outage in the last three years that was caused by the provider.
  • Networking and configuration issues are emerging as two of the more common causes of service degradation, while power outages are becoming somewhat less of an issue. Power issues are historically caused by failures in UPSs, transfer switches and generators.

While tech gets much of the blame for failures, the human element must be taken into account as well. Just what level human error plays is difficult to measure. In Uptime’s 2021 data center resiliency survey, 42% of respondents said they had experienced an outage in the last three years due to human error.

Among those, 57% cited data center staff execution (failure to follow procedure) and 44% cited incorrect staff processes/procedures as root causes. From the research, it is clear a better focus on management and training will produce better service delivery performance.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2021 IDG Communications, Inc.

SD-WAN buyers guide: Key questions to ask vendors (and yourself)