Skip Links

Network World

Matthew Nickasch

"IT" Far From Problem-Free

By Matthew Nickasch on Fri, 08/15/08 - 2:32pm.

If you've been following the general IT industry news for the last week or so, you've noticed some major stories that have hit the wire. From the VMWare ESX/ESXi bug that brought some virtualization infrastructures to a halt, to the multiple instance Google Apps / Gmail outages, it has been recently quite busy.

On a lighter note, and since it is Friday afternoon, let's reflect on what these outages and problems mean to IT managers, corporate leaders, and the end users.

No matter how much we invest in redundancy, quality control procedures, and error-correcting systems, there is still room for failure. This industry is simply not a perfect science. The number of possible permutations and outcomes of a sequence of events seems infinite. My basic point? Problems have occurred, are occurring, and will continue to occur.

There has been great gnashing of teeth and frustration over the recent VMWare ESX/ESXi software bug. Similar responses have been noted in regard to the Google Apps / Gmail outages. Honestly, I was frustrated too. I had to personally delay a major physical-to-virtual migration project because of the VMWare bug. Similarly, I know people and organizations impacted by the Google outages. However, while generally unacceptable, we still must realize that issues of magnitude do occur. Even organizations that clearly demonstrate a quality control and release management process involving multiple levels of checks-and-balances can be significantly impacted by that "single line of pesky code."

As long as humans are at the controls - developing, maintaining, and deploying systems - will we continue to be prone to problems from time to time. Both VMWare and Google quickly addressed the issue, apologized to their customer base, and successfully regained control of their systems or code. People and organizations do make mistakes. Remember the last time that "critical system" went down under your watch? Remember the rash of angry callers and emailers? I certainly do.

This acceptance of periodic failure is one thing to ingest and accept, but multiple, recurring, or "carefree" problems or errors are simply unacceptable. For example, a repeating occurrence of similar VMWare bugs is not acceptable, nor is a rash of frequent Google Apps outages. In IT, it's my perception that successful companies learn from their mistakes, and work hard to regain customer trust. Period.

This industry is not a perfect science. As a co-worker put it, there's a lot of potential failure between the hard drive platters spinning for years at 10k rpm, to the bits at the end of the 1000-mile link. It's amazing that the whole process works as well as it does.

Technology problems have been around for donkey's years!

0

I recently blogged about the exact same topic! Here is the extract:

Murphy’s Law states: "if anything can go wrong, it will." The first reported use of the term Murphy’s Law is in 1952 in a book by Anne Roe, quoting an unnamed physicist. The observation inherent in Murphy's Law, with which so many IT professionals have an affinity, has great relevance to the field of problem management. There is a close correlation between Murphy’s Law and Heinrich’s Incident Pyramid (described below). In complex technological systems as found in IT, it is inevitable that incidents will happen. Both "Murphy" and Heinrich point to the inevitability of an incident, one is an adage and the other a research but both have a similar conclusion. The means to combat "go wrong" lies in IT Safety. The terms of reference of IT Safety is to reduce the rate at which shit happens ("go wrong"). It is possible to reduce shit happening, from once a day to once a week, by using safer processes that result in the time period between near misses being larger. This improves safety in IT.
The Incident Pyramid originated in 1931 when H.W. Heinrich described it in his book, Industrial Accident Prevention: A Scientific Approach. The Incident pyramid proposes that for every 300 unsafe acts there are 29 minor injuries and one major injury. The Incident Pyramid is corroborating evidence for Murphy's Law, which was published 21 years later.


Besides the Incident Pyramid the book also illustrates Heinrich's theory of incident causation. Unsafe acts lead to minor injuries and, over time, to major injury. All incidents occur as a result of many factors or multiple causes. Root Cause Analysis based on this theory is used in incident investigations whereby the obvious physical circumstance of the incident is investigated to determine its cause, and what led to that, and so forth, until no further factors can be identified. To avoid highlighting functional inadequacies many organizations simply identify the cause of most incidents as human error, or failure to follow safety rules. This dishonesty is often labelled as scapegoating. This habit of blaming major incidents on humans damages IT Safety.
In 1969, the Insurance Company of North America conducted a subsequent study using more than 1.7 million incidents reported by nearly 300 companies in 21 industrial groups. That study revealed a similar pattern to Heinrich’s but with slight deviations in the ratios. For each serious injury, there were 10 minor injuries, 30 property-damage incidents and 600 near-miss incidents that resulted in no injury or property damage.
The incident pyramid from Dresser-Rand.

Problem free?

0

Correct - nothing (in this world?) can be "problem free". I just got a bottle of expensive wine which had gone bad and the IT quality control is nothing compared to what the winemakers use!

Now - I have seen some alarming issues in IT QA. Too often the products / systems are tested for correct working set but not against misuse, mistakes, corrupted data, etc. And, keep testing - even if the production system seems working great, keep testing!

The financial business does (often) continuous testing, they have to, but the technical side doesn't - is it ignorance, arrogance, lack of experience or just successful marketing - I don't know. But the reality is, as the article says, nothing can be %100 correct, secure, work all the time, etc - it's a risk management issue but has to be acknowledged.

And - really, I would recommend to VMWare and other companies, get real CM/SC systems (and expertise) where this kind of bugs are mostly eliminated. Google and Apple are a little different - they just didn't do the capacity planning right - you know, of course, capacity is "a little more" than performance. But for the user - these problems often seem the same - the system is not working.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
Welcome, visitor. Register Log in
About Considering Convergence
Matthew Nickasch is an independent consultant and analyst in the IP communication and convergence fields. His current and previous consulting experience includes systems architecture, virtualization, telecommunications, and converged networks for the financial, education, and healthcare industries. In addition to his consulting responsibilities, he has been active in the research realm, recently publishing and presenting on topics including routing protocol security and ERP and transactional database auditing. While his interests include directory services and corporate compliance, Nickasch's focus is on converged networks and IP communications.
Blog Roll
Inside the Asterisk
http://blogs.digium.com/
Nearpoints
http://www.networkworld.com/community/mathias