Anatomy of a service outage: How did we get here?

The right analytics system can help IT prevent problems before they actually crop up

data center down

Although vendor-written, this contributed piece does not promote a product or service and has been edited and approved by Network World editors.

As euphemisms go, it's hard to beat the term “service outage” as used by IT departments. While it sounds benign -- something stopped working but tech teams will soon restore order -- anyone familiar with the reality knows the term really means “Huge hit to bottom line.”

A quick perusal of the tech news will confirm this. Delta Airline’s global fleet was just grounded by a data center problem.  A recent one day service outage at cost the company $20 million.  Hundreds of thousands of customers were inconvenienced in May when they couldn't reach due to a “glitch.” And a service outage at HSBC earlier this year prompted one of the Bank of England's top regulators to lament that, “Every few months we have yet another IT failure at a major bank... We can’t carry on like this.”

Nearly half a century into the computer era, however, we do still carry on like this. According to research by IDC, infrastructure failure can cost large enterprises $100,000 per hour, while failure of critical applications could cost as much as a million dollars an hour. Regardless of the solutions that are thrown at the problem, service outages are as common – and as lethal – as ever.

So where do outages come from? An interesting study by a University of Chicago team lists the 13 leading causes of service outages at online services companies, but the lessons are just as valuable for IT departments. The researchers parsed over 1,000 web articles and papers that discussed the causes of 516 unplanned outages, hoping to determine what happened, why it happened, and how it was fixed.

Upgrades, for example, were responsible for 15% of service outages. One could presume that every upgrade “had been tested thoroughly in an offline environment.” Apparently not; otherwise, it would stand to reason, upgrades wouldn't be such a major factor in service outages. And even if an upgrade was tested on a server, “upgrades pushed to the full ecosystem can be fragile” - meaning that the new upgrade had not been tested thoroughly enough, said the study.

Misconfiguration is another important factor -- responsible for 10% of service outages. While IT workers are often responsible for misconfiguration, the study says, it’s not always their fault.  Often new software or upgrades to existing applications make changes to configuration files, with the application satisfying its own “needs”- while throwing things out of whack elsewhere. “A configuration change in one subsystem might need to be followed with changes in the other subsystems, otherwise the full ecosystem will have conflicting views of what is correct,” the study says.

Other causes of service outages include undue stress on an ecosystem due to traffic issues, power outages, security issues - and of course, human error. But perhaps the biggest issue -- the most common reason for service outages, according to the University of Chicago study, is “unknown.” Of the 516 outages studied, the team could not determine the root cause of 294 (48%) of outages. Once an IT department gets into the unknown territory, they’re in big trouble.  If you can’t figure out what the problem is, how can you fix it?

One way is to use automatic big data analytics to identify potential outages. These systems evaluate network elements on an ongoing basis, analyzing the relationships between hardware, software, configuration files, network connections, and everything else that makes up an IT system. IT department workers can't do this work – because there is just too much information to keep track of.

These systems can do what humans can't - identify risky deviations from industry best practices and vendor recommendations while providing early warning capabilities to help administrators understand the impact of any change. So, when the time comes to install new software, for example, analytics systems can send out alerts about the implications of the installation, what services and functions will be affected, and what steps should be taken to prevent the risk of an outage.

Organizations upgrading from vSphere 5.5 to 6.x, for example, are on their own when trying to fine tune their systems. There are many issues to consider – and it's almost impossible for IT workers to ensure that all the bases have been properly covered. All it would take is a missed step to significantly hamper operations, and even cause yet another dreaded outage. With proper operations analytics in place, users can complete the job much quicker and more reliably, leveraging the power of automated configuration validation.

Big data analysis of this sort different than (and complementary to) log analysis and other approaches that evaluate historical data leading to outages. While it’s not quite prophecy, the right analytics system can help IT teams prevent problems from cropping up before they actually do. Given the complicated environment IT teams operate in nowadays, any help – Divine or otherwise – is likely to be welcome.

Gil Hecht is the CEO of Continuity Software, a provider of IT Operations Analytics for infrastructure outage prevention.

Copyright © 2016 IDG Communications, Inc.

The 10 most powerful companies in enterprise networking 2022