If you've been following the general IT industry news for the last week or so, you've noticed some major stories that have hit the wire. From the VMWare ESX/ESXi bug that brought some virtualization infrastructures to a halt, to the multiple instance Google Apps / Gmail outages, it has been recently quite busy.
On a lighter note, and since it is Friday afternoon, let's reflect on what these outages and problems mean to IT managers, corporate leaders, and the end users.
No matter how much we invest in redundancy, quality control procedures, and error-correcting systems, there is still room for failure. This industry is simply not a perfect science. The number of possible permutations and outcomes of a sequence of events seems infinite. My basic point? Problems have occurred, are occurring, and will continue to occur.
There has been great gnashing of teeth and frustration over the recent VMWare ESX/ESXi software bug. Similar responses have been noted in regard to the Google Apps / Gmail outages. Honestly, I was frustrated too. I had to personally delay a major physical-to-virtual migration project because of the VMWare bug. Similarly, I know people and organizations impacted by the Google outages. However, while generally unacceptable, we still must realize that issues of magnitude do occur. Even organizations that clearly demonstrate a quality control and release management process involving multiple levels of checks-and-balances can be significantly impacted by that "single line of pesky code."
As long as humans are at the controls - developing, maintaining, and deploying systems - will we continue to be prone to problems from time to time. Both VMWare and Google quickly addressed the issue, apologized to their customer base, and successfully regained control of their systems or code. People and organizations do make mistakes. Remember the last time that "critical system" went down under your watch? Remember the rash of angry callers and emailers? I certainly do.
This acceptance of periodic failure is one thing to ingest and accept, but multiple, recurring, or "carefree" problems or errors are simply unacceptable. For example, a repeating occurrence of similar VMWare bugs is not acceptable, nor is a rash of frequent Google Apps outages. In IT, it's my perception that successful companies learn from their mistakes, and work hard to regain customer trust. Period.
This industry is not a perfect science. As a co-worker put it, there's a lot of potential failure between the hard drive platters spinning for years at 10k rpm, to the bits at the end of the 1000-mile link. It's amazing that the whole process works as well as it does.