If you've been following the general IT industry news for the last week or so, you've noticed some major stories that have hit the wire. From the VMWare ESX/ESXi bug that brought some virtualization infrastructures to a halt, to the multiple instance Google Apps / Gmail outages, it has been recently quite busy.
On a lighter note, and since it is Friday afternoon, let's reflect on what these outages and problems mean to IT managers, corporate leaders, and the end users.
No matter how much we invest in redundancy, quality control procedures, and error-correcting systems, there is still room for failure. This industry is simply not a perfect science. The number of possible permutations and outcomes of a sequence of events seems infinite. My basic point? Problems have occurred, are occurring, and will continue to occur.
There has been great gnashing of teeth and frustration over the recent VMWare ESX/ESXi software bug. Similar responses have been noted in regard to the Google Apps / Gmail outages. Honestly, I was frustrated too. I had to personally delay a major physical-to-virtual migration project because of the VMWare bug. Similarly, I know people and organizations impacted by the Google outages. However, while generally unacceptable, we still must realize that issues of magnitude do occur. Even organizations that clearly demonstrate a quality control and release management process involving multiple levels of checks-and-balances can be significantly impacted by that "single line of pesky code."
As long as humans are at the controls - developing, maintaining, and deploying systems - will we continue to be prone to problems from time to time. Both VMWare and Google quickly addressed the issue, apologized to their customer base, and successfully regained control of their systems or code. People and organizations do make mistakes. Remember the last time that "critical system" went down under your watch? Remember the rash of angry callers and emailers? I certainly do.
This acceptance of periodic failure is one thing to ingest and accept, but multiple, recurring, or "carefree" problems or errors are simply unacceptable. For example, a repeating occurrence of similar VMWare bugs is not acceptable, nor is a rash of frequent Google Apps outages. In IT, it's my perception that successful companies learn from their mistakes, and work hard to regain customer trust. Period.
This industry is not a perfect science. As a co-worker put it, there's a lot of potential failure between the hard drive platters spinning for years at 10k rpm, to the bits at the end of the 1000-mile link. It's amazing that the whole process works as well as it does.
Technology problems have been around for donkey's years!
I recently blogged about the exact same topic! Here is the extract:
The Incident Pyramid originated in 1931 when H.W. Heinrich described it in his book, Industrial Accident Prevention: A Scientific Approach. The Incident pyramid proposes that for every 300 unsafe acts there are 29 minor injuries and one major injury. The Incident Pyramid is corroborating evidence for Murphy's Law, which was published 21 years later.
Problem free?
Correct - nothing (in this world?) can be "problem free". I just got a bottle of expensive wine which had gone bad and the IT quality control is nothing compared to what the winemakers use!
Now - I have seen some alarming issues in IT QA. Too often the products / systems are tested for correct working set but not against misuse, mistakes, corrupted data, etc. And, keep testing - even if the production system seems working great, keep testing!
The financial business does (often) continuous testing, they have to, but the technical side doesn't - is it ignorance, arrogance, lack of experience or just successful marketing - I don't know. But the reality is, as the article says, nothing can be %100 correct, secure, work all the time, etc - it's a risk management issue but has to be acknowledged.
And - really, I would recommend to VMWare and other companies, get real CM/SC systems (and expertise) where this kind of bugs are mostly eliminated. Google and Apple are a little different - they just didn't do the capacity planning right - you know, of course, capacity is "a little more" than performance. But for the user - these problems often seem the same - the system is not working.