The severity of data-center outages appears to be falling, while the cost of outages continues to climb. Power failures are \u201cthe biggest cause of significant site outages.\u201d Network failures and IT system glitches also bring down data centers, and human error often contributes.\nThose are some of the problems pinpointed in the most recent Uptime Institute data-center outage report, which analyzes types of outages, their frequency, and what they cost both in money and consequences.\nUnreliable data is an ongoing problem\nUptime cautions that data relating to outages should be treated skeptically given the lack of transparency of some outage victims and the quality of reporting mechanisms. \u201cOutage information is opaque and unreliable,\u201d said Andy Lawrence, executive director of research at Uptime, during a briefing about Uptime\u2019s Annual Outages Analysis 2023.\nWhile some industries, such as airlines, have mandatory reporting requirements, there\u2019s limited reporting in other industries, Lawrence said. \u201cSo we have to rely on our own means and methods to get the data. And as we all know, not everybody wants to share details about outages for a whole variety of reasons. Sometimes you get a very detailed root-cause analysis, and other times you get pretty well nothing,\u201d he said.\nThe Uptime report culled data from three main sources: Uptime\u2019s Abnormal Incident Report (AIRs) database; its own surveys; and public reports, which include news stories, social media, outage trackers, and company statements. The accuracy of each varies. Public reports may lack details and sources might not be trustworthy, for example. Uptime rates its own surveys as producing fair\/good data, since the respondents are anonymous, and their job roles vary. AIRs quality is deemed very good, since it comprises detailed, facility-level data voluntarily shared by data-center owners and operators among their peers.\nOutage rates are shrinking slightly\nThere\u2019s evidence that outage rates have been gradually falling in recent years, according to Uptime.\nThat doesn\u2019t mean the total number of outages is shrinking\u2014in fact, the number of outages globally increases each year as the data-center industry expands. \u201cThis can give the false impression that the rate of outages relative to IT load is growing, whereas the opposite is the case,\u201d Uptime reported. \u201cThe frequency of outages is not growing as fast as the expansion of IT or the global data-center footprint.\u201d\nOverall, Uptime has observed a steady decline in the outage rate per site, as tracked through four of its own surveys of data-center managers and operators conducted from 2020 to 2022. In 2022, 60% of survey respondents said they had an outage in the past three years, down from 69% in 2021 and 78% in 2020.\n\u201cThere seems to be a gently, gently improving picture of the outage rate,\u201d Lawrence said.\nOutage severity appears to be decreasing\nWhile 60% of data-center sites have experienced an outage in the past three years, only a small proportion are rated serious or severe.\nUptime measures the severity of outages on a scale of one to five, with five being the most severe. Level 1 outages are negligible and cause no service disruptions. Level five mission-critical outages involve major and damaging disruption of services and\/or operations and often include large financial losses, safety issues, compliance breaches, customer losses. and reputational damage.\nLevel 5 and Level 4 (serious) outages historically account for about 20% of all outages. In 2022, outages in the serious\/severe categories fell to 14%.\nA key reason is that data-center operators are better equipped to handle unexpected events, according to Chris Brown, chief technical officer at Uptime. \u201cWe\u2019ve become much better at designing systems and managing operations to a point where a single fault or failure does not necessarily result in a severe or serious outage,\u201d he said.\nToday\u2019s systems are built with redundancy, and operators are more disciplined about creating systems that are capable of responding to abnormal incidences and averting outages, Brown said.\nThe financial toll is rising\nWhen outages do occur, they are becoming more expensive\u2014a trend that is likely to continue as dependency on digital services grows.\nLooking at the last four years of Uptime\u2019s own survey data, the proportion of major outages that cost more than $100,000 in direct and indirect costs is increasing. In 2019, 60% of outages fell under $100,000 in terms of recovery costs. In 2022, just 39% of outages cost less than $100,000.\nAlso in 2022, 25% of respondents said their most recent outage cost more than $1 million, and 45% said their most recent outage cost between $100,000 and $1 million.\nInflation is part of the reason, Brown said; the cost of replacement equipment and labor are higher.\nMore significant is the degree to which companies depend on digital services to run their businesses. The loss of a critical IT service can be tied directly to disrupted business and lost revenue. \u201cAny of these outages, especially the serious and severe outages, have the ability to impact multiple organizations, and a larger swath of people,\u201d Brown said, \u201cand the cost of having to mitigate that is ever increasing.\u201d\nThird-party providers are behind most high-profile, public outages\nAs more workloads are outsourced to external service providers, the reliability of third-party digital infrastructure companies is increasingly important to enterprise customers, and these providers tend to suffer the most public outages.\nThird-party commercial operators of IT and data centers\u2014cloud providers, digital service providers, telecommunications providers\u2014accounted for 66% of all the public outages tracked since 2016, Uptime reported. Looked at year-by-year, the percentage has been creeping up. In 2021 the proportion of outages caused by cloud, colocation, telecommunications, and hosting companies was 70%, and in 2022 it was up to 81%.\n\u201cThe more that companies push their IT services into other people\u2019s domain, they\u2019re going to have to do their due diligence\u2014and also continue to do their due diligence\u201d even after the deal is struck,\u201d Brown said.\nHuman error is a frequent contributor to outages and a relatively simple factor to address\nWhile it\u2019s rarely the single or root cause of an outage, human error plays some role in 66% to 80% of all outages, according to Uptime\u2019s estimate based on 25 years of data. But it acknowledges that analyzing human error is challenging. Shortcomings such as improper training, operator fatigue, and a lack of resources can be difficult to pinpoint.\nUptime found that human error-related outages are mostly caused either by staff failing to follow procedures (cited by 47% of respondents) or by the procedures themselves being faulty (40%). Other common causes include in-service issues (27%), installation issues (20%), insufficient staff (14%), preventative maintenance-frequency issues (12%), and data-center design or omissions (12%).\nOn the positive side, investing in good training and management processes can go a long way toward reducing outages without costing too much.\n\u201cYou don\u2019t need to go to a banker and get a bunch of capital money to solve these problems,\u201d Brown said. \u201cPeople need to make the effort to create the procedures, test them, make sure they\u2019re correct, train their staff to follow them, and then have the oversight to ensure that they truly are following them.\u201d\n\u201cThis is the low hanging fruit to prevent outages, because human error is implicated in so many,\u201d Lawrence said.\nPower problems continue to hamper data-center reliability\nUptime said its current survey findings are consistent with previous years\u2019 and show that on-site power problems remain the biggest cause of significant site outages by a large margin. This despite the fact that most outages have several causes, and that the quality of reporting about them varies.\nIn 2022, 44% of respondents said power was the primary cause of their most recent impactful incident or outage. Power was also the leading cause of significant outages in 2021 (cited by 43%) and 2020 (37%)\nNetwork issues, IT system errors, and cooling failures also stand out as troubling causes, Uptime said.\nNetwork complexity leads to more outages\nUptime used its own data, from its\u00a02023 Uptime resiliency survey, to dig into network outage trends. Among survey respondents, 44% said their organization had experienced a major outage caused by network or connectivity issues over the past three years. Another 45% said no, and 12% didn\u2019t know. \u00a0\nThe two most common causes of networking- and connectivity-related outages are configuration or change management failure (cited by 45% of respondents) and a third-party network provider\u2019s failure (39%).\nUptime attributed the trend to today\u2019s network complexity. \u201cIn modern, dynamically switched and software-defined environments, programs to manage and optimize networks are constantly revised or reconfigured. Errors become inevitable, and in such a complex and high-throughput environment, frequent small errors can propagate across networks, resulting in cascading failures that can be difficult to stop, diagnose, and fix,\u201d Uptime reported.\nOther common causes of major network-related outages include:\n\nHardware failure: 37%\nLine breakages: 27%\nFirmware\/software error: 23%\nCyberattack: 14%\nNetwork\/congestion failure: 12%\nWeather-related incident: 7%\nCorrupted firewall\/routing table issues: 6%\n\nCommon causes of IT system and software outages\nWhen Uptime asked respondents to\u00a0its resiliency survey if their organization experienced a major outage caused by an IT systems or software failure over the past three years, 36% said yes, 50% said no, and 15% didn\u2019t know. The most common causes of outages related to IT systems and software are:\n\nConfiguration\/change management issue: cited by 64%\nFirmware\/software fault: 40%\nHardware failure: 36%\nCapacity\/congestion issue: 22%\nData synchronization\/corruption: 14%\nCyberattack\/security issue: 10%\n\nFires aren\u2019t common but can be devastating\nPublicly recorded outages, which include outages that are reported in the media, reveal a wide range of causes. The causes can differ from what data-center operators and IT teams report, since the media sources\u2019 knowledge and understanding of outages depends on their perspective. \u201cWhat\u2019s really interesting is the sheer variety of causes, and that\u2019s partly because this is how the public and the media perceive them,\u201d Lawrence said.\nFire is one cause that showed up among publicly reported outages but didn\u2019t rank highly among IT-related sources. Specifically, Uptime found that 7% of publicly reported data-center outages were caused by fires. In the web briefing, Uptime researchers related the incidence of data-center fires to increasing use of lithium-ion (Li-ion) batteries.\nLi-ion batteries have a smaller footprint, simpler maintenance, and longer lifespan compared to lead-acid batteries. However, Li-ion batteries present a greater fire risk. A Maxnod data center in France suffered a devasting fire on March 28, 2023, and \u201cwe believe it\u2019s caused by lithium-ion battery fire,\u201d Lawrence said. A lithium-ion battery fire is also the reported cause of a major fire on Oct. 15, 2022, at a South Korea colocation facility owned by SK Group and operated by its C&C subsidiary.\n\u201cWe find, every time we do these surveys, fire doesn\u2019t go away,\u201d Lawrence said.