Network World
Saturday, November 22, 2008
DNSstuff.com
Get information about your IP
IP Information
50+ On-demand DNS and network tools

Considering Convergence

Navigation

"IT" Far From Problem-Free

If you've been following the general IT industry news for the last week or so, you've noticed some major stories that have hit the wire. From the VMWare ESX/ESXi bug that brought some virtualization infrastructures to a halt, to the multiple instance Google Apps / Gmail outages, it has been recently quite busy.

On a lighter note, and since it is Friday afternoon, let's reflect on what these outages and problems mean to IT managers, corporate leaders, and the end users.

No matter how much we invest in redundancy, quality control procedures, and error-correcting systems, there is still room for failure. This industry is simply not a perfect science. The number of possible permutations and outcomes of a sequence of events seems infinite. My basic point? Problems have occurred, are occurring, and will continue to occur.

There has been great gnashing of teeth and frustration over the recent VMWare ESX/ESXi software bug. Similar responses have been noted in regard to the Google Apps / Gmail outages. Honestly, I was frustrated too. I had to personally delay a major physical-to-virtual migration project because of the VMWare bug. Similarly, I know people and organizations impacted by the Google outages. However, while generally unacceptable, we still must realize that issues of magnitude do occur. Even organizations that clearly demonstrate a quality control and release management process involving multiple levels of checks-and-balances can be significantly impacted by that "single line of pesky code."

As long as humans are at the controls - developing, maintaining, and deploying systems - will we continue to be prone to problems from time to time. Both VMWare and Google quickly addressed the issue, apologized to their customer base, and successfully regained control of their systems or code. People and organizations do make mistakes. Remember the last time that "critical system" went down under your watch? Remember the rash of angry callers and emailers? I certainly do.

This acceptance of periodic failure is one thing to ingest and accept, but multiple, recurring, or "carefree" problems or errors are simply unacceptable. For example, a repeating occurrence of similar VMWare bugs is not acceptable, nor is a rash of frequent Google Apps outages. In IT, it's my perception that successful companies learn from their mistakes, and work hard to regain customer trust. Period.

This industry is not a perfect science. As a co-worker put it, there's a lot of potential failure between the hard drive platters spinning for years at 10k rpm, to the bits at the end of the 1000-mile link. It's amazing that the whole process works as well as it does.

Technology problems have been around for donkey's years!

Useful answer?
0

I recently blogged about the exact same topic! Here is the extract:

Murphy’s Law states: "if anything can go wrong, it will." The first reported use of the term Murphy’s Law is in 1952 in a book by Anne Roe, quoting an unnamed physicist. The observation inherent in Murphy's Law, with which so many IT professionals have an affinity, has great relevance to the field of problem management. There is a close correlation between Murphy’s Law and Heinrich’s Incident Pyramid (described below). In complex technological systems as found in IT, it is inevitable that incidents will happen. Both "Murphy" and Heinrich point to the inevitability of an incident, one is an adage and the other a research but both have a similar conclusion. The means to combat "go wrong" lies in IT Safety. The terms of reference of IT Safety is to reduce the rate at which shit happens ("go wrong"). It is possible to reduce shit happening, from once a day to once a week, by using safer processes that result in the time period between near misses being larger. This improves safety in IT.
The Incident Pyramid originated in 1931 when H.W. Heinrich described it in his book, Industrial Accident Prevention: A Scientific Approach. The Incident pyramid proposes that for every 300 unsafe acts there are 29 minor injuries and one major injury. The Incident Pyramid is corroborating evidence for Murphy's Law, which was published 21 years later.


Besides the Incident Pyramid the book also illustrates Heinrich's theory of incident causation. Unsafe acts lead to minor injuries and, over time, to major injury. All incidents occur as a result of many factors or multiple causes. Root Cause Analysis based on this theory is used in incident investigations whereby the obvious physical circumstance of the incident is investigated to determine its cause, and what led to that, and so forth, until no further factors can be identified. To avoid highlighting functional inadequacies many organizations simply identify the cause of most incidents as human error, or failure to follow safety rules. This dishonesty is often labelled as scapegoating. This habit of blaming major incidents on humans damages IT Safety.
In 1969, the Insurance Company of North America conducted a subsequent study using more than 1.7 million incidents reported by nearly 300 companies in 21 industrial groups. That study revealed a similar pattern to Heinrich’s but with slight deviations in the ratios. For each serious injury, there were 10 minor injuries, 30 property-damage incidents and 600 near-miss incidents that resulted in no injury or property damage.
The incident pyramid from Dresser-Rand.

Problem free?

Useful answer?
0

Correct - nothing (in this world?) can be "problem free". I just got a bottle of expensive wine which had gone bad and the IT quality control is nothing compared to what the winemakers use!

Now - I have seen some alarming issues in IT QA. Too often the products / systems are tested for correct working set but not against misuse, mistakes, corrupted data, etc. And, keep testing - even if the production system seems working great, keep testing!

The financial business does (often) continuous testing, they have to, but the technical side doesn't - is it ignorance, arrogance, lack of experience or just successful marketing - I don't know. But the reality is, as the article says, nothing can be %100 correct, secure, work all the time, etc - it's a risk management issue but has to be acknowledged.

And - really, I would recommend to VMWare and other companies, get real CM/SC systems (and expertise) where this kind of bugs are mostly eliminated. Google and Apple are a little different - they just didn't do the capacity planning right - you know, of course, capacity is "a little more" than performance. But for the user - these problems often seem the same - the system is not working.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <i> <b> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote> <br /> <br> <p>
  • Lines and paragraphs break automatically.
  • You can use BBCode tags in the text.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.

About Matthew Nickasch

Nickasch has been very involved in IT since he was just 13. His current and previous consulting experience includes systems architecture, virtualization, and converged networks for the financial, education, and healthcare industries. Matthew currently attends the University of Wisconsin-Platteville, where he also works as a network management assistant. While his interests include directory services and routing protocols, Nickasch's focus is on converged networks and voice over IP.

RSS feed XML feed

Nickasch's archive.

The opinions expressed in this Weblog are those of the writer and may not represent the opinions of Network World.

Advertisement: