• United States

Root cause analysis lets you fix the problem and get back to your coffee quicker

Mar 07, 20063 mins
Data CenterNetwork Management Software

* Root cause analysis is better than an alert storm

When a storage element becomes inaccessible in the course of the business day, the order of events typically involves getting an alert, identifying the problem, fixing the problem, getting the system back up, getting the services going again, and going back to drinking coffee.

What? You say in your world things are not so simple? OK then, let’s try again, this time with just a bit more granularity.

It all begins with getting an alert, which may be an e-mail or beeper message (if you are lucky), or a phone call from an annoyed senior manager (if you are not). Likely as not however, if you are one of the lucky ones and get an alert from the system, it still turns out that your luck is perhaps not so good after all. Why? Because instead of receiving a single alert telling you what the problem is, you get a flurry of alerts, each telling you about an individual symptom of the problem. Some sites refer to this as an “alert storm,” some as a “blizzard of trouble tickets”; several sites have names that are a touch more colorful.

Steps two and three, find the problem and – assuming it’s something within your area of assumed competence – fix it, often take the most time. Step two is increasingly challenging if your system just identifies symptoms rather than problems, telling you, for example, that the system “cannot write to disk XYZ; an I/O problem exists.” Such messages have little value beyond the fact that they are often punctuated correctly. Clearly their lack of actionable information tends to delay step three.

At this point color the user community an increasingly annoyed shade of something approaching magenta, unless of course the people whose work was interrupted are in sales: in this case, color them ultraviolet.

Eventually – the time involved here is pretty hard to define, but let’s assume it is somewhere between a half hour and a half-day – you get the problem fixed, get the systems back up, get all the dependant processes back in operation, let the various lines of business know that they can get back to work, and at last get back to your now-congealing coffee.

How much of your company’s business was interrupted? How much money did your competitors make while your company’s systems were inaccessible? Who cares! You’re in IT, and that other stuff is some other department’s problem!

It’s been another day in paradise.

I write about root cause analysis frequently in this newsletter because of that technology’s potential to provide significant insight into why systems fail and the way in which they do so. Properly designed, such software can send you right to problem so you can fix it as rapidly as possible. Although no packages do a complete job of it, at least two that I know of provide focused knowledge: CentrePath’s Magellan and EMC’s Smarts take quite different approaches, but both get a lot of it right, focusing on identifying problems rather than just listing symptoms.

Even these two don’t tell you all that you really need to know though. Wouldn’t it be useful to understand not just which storage or networking elements have caused an outage, but also what other elements in the system are going to be affected as well? Even better, how about a real-time system that let’s you know ahead of time what business processes are going to be affected if you pull a board?

Next time we’ll speculate on where this might go.