• United States

The ‘chaos factor’ of unplanned downtime

Nov 25, 20033 mins
Data Center

* The problems of unplanned downtime

If we draw a pie chart representing the total downtime at almost any IT shop, the slice for planned downtime – when maintenance, upgrades and so forth take place – would invariably be the largest slice of all.  But if we were to build a companion chart indicating the stress that accompanies downtime, it is a given that unplanned downtime would take up the most space.

Unplanned downtime may be the result of anything from a true disaster to a simple operator error. Although this category represents only about a tenth of total downtime, because of its unexpected and intrusive nature its result can be catastrophic. Whether it’s a complete shutdown of a company’s critical systems or a temporary degradation of application performance, system failures can cost organizations much more than you might imagine.  Such unplanned downtime can occur for any number of reasons, usually due to some disruption in distinct but interrelated components such as networks, storage and applications. How will the IT support team respond?

Consider the panic that sets in at any business if it cannot take orders, or support its sales team in the field, or if its e-business Web site goes down, or if it loses engineering data on a crucial development system.  Calls go out, the necessary team members are brought in, and resources are quickly deployed to diagnose the cause of the disruption. Whenever the necessary human, software, or hardware resources are not available however, time is wasted before the appropriate senior personnel can respond to the problem and provide the necessary skill levels (or authority levels) to make and implement correct decisions.  Consider this the “chaos factor.”

The potential for an elevated chaos factor grows as IT environments get more complex, and with the increase in “super applications” – business processes comprised of multiple apps running on different hardware and operating system, and often distributed across the network – the problem is becoming ever more acute. The symptom of a problem – a system crash or process degradation – is usually fairly easy to identify, but identifying a symptom in this case may do nothing to pinpoint the problem. 

Why? Because in such cases fault isolation can be extremely complex.  When a problem may be anywhere within the system of the supporting infrastructure, monitoring and managing single applications within the overall process clearly does not provide sufficient protection.  Any system that only looks at discrete parts of a system inevitably fails to understand the problems that may result when the subsets of the system interact. 

More on this next time.

EDITOR’s NOTE: Due to the U.S. Thanksgiving holiday we will be sending just one newsletter this week. Regular service will resume next week. We wish you and your family a happy Thanksgiving.