Skip Links

Network World

Matthew Nickasch

"IT" Far From Problem-Free

By Matthew Nickasch on Fri, 08/15/08 - 2:32pm.

If you've been following the general IT industry news for the last week or so, you've noticed some major stories that have hit the wire. From the VMWare ESX/ESXi bug that brought some virtualization infrastructures to a halt, to the multiple instance Google Apps / Gmail outages, it has been recently quite busy.

On a lighter note, and since it is Friday afternoon, let's reflect on what these outages and problems mean to IT managers, corporate leaders, and the end users.

No matter how much we invest in redundancy, quality control procedures, and error-correcting systems, there is still room for failure. This industry is simply not a perfect science. The number of possible permutations and outcomes of a sequence of events seems infinite. My basic point? Problems have occurred, are occurring, and will continue to occur.

There has been great gnashing of teeth and frustration over the recent VMWare ESX/ESXi software bug. Similar responses have been noted in regard to the Google Apps / Gmail outages. Honestly, I was frustrated too. I had to personally delay a major physical-to-virtual migration project because of the VMWare bug. Similarly, I know people and organizations impacted by the Google outages. However, while generally unacceptable, we still must realize that issues of magnitude do occur. Even organizations that clearly demonstrate a quality control and release management process involving multiple levels of checks-and-balances can be significantly impacted by that "single line of pesky code."

As long as humans are at the controls - developing, maintaining, and deploying systems - will we continue to be prone to problems from time to time. Both VMWare and Google quickly addressed the issue, apologized to their customer base, and successfully regained control of their systems or code. People and organizations do make mistakes. Remember the last time that "critical system" went down under your watch? Remember the rash of angry callers and emailers? I certainly do.

This acceptance of periodic failure is one thing to ingest and accept, but multiple, recurring, or "carefree" problems or errors are simply unacceptable. For example, a repeating occurrence of similar VMWare bugs is not acceptable, nor is a rash of frequent Google Apps outages. In IT, it's my perception that successful companies learn from their mistakes, and work hard to regain customer trust. Period.

This industry is not a perfect science. As a co-worker put it, there's a lot of potential failure between the hard drive platters spinning for years at 10k rpm, to the bits at the end of the 1000-mile link. It's amazing that the whole process works as well as it does.

About Considering Convergence
Matthew Nickasch is an independent consultant and analyst in the IP communication and convergence fields. His current and previous consulting experience includes systems architecture, virtualization, telecommunications, and converged networks for the financial, education, and healthcare industries. In addition to his consulting responsibilities, he has been active in the research realm, recently publishing and presenting on topics including routing protocol security and ERP and transactional database auditing. While his interests include directory services and corporate compliance, Nickasch's focus is on converged networks and IP communications.
 

Most Discussed Posts

On The Web
Facebook
LinkedIn
Blog Roll
Inside the Asterisk
http://blogs.digium.com/
Nearpoints
http://www.networkworld.com/community/mathias