* Root cause analysis is better than an alert storm When a storage element becomes inaccessible in the course of the business day, the order of events typically involves getting an alert, identifying the problem, fixing the problem, getting the system back up, getting the services going again, and going back to drinking coffee.What? You say in your world things are not so simple? OK then, let’s try again, this time with just a bit more granularity.It all begins with getting an alert, which may be an e-mail or beeper message (if you are lucky), or a phone call from an annoyed senior manager (if you are not). Likely as not however, if you are one of the lucky ones and get an alert from the system, it still turns out that your luck is perhaps not so good after all. Why? Because instead of receiving a single alert telling you what the problem is, you get a flurry of alerts, each telling you about an individual symptom of the problem. Some sites refer to this as an “alert storm,” some as a “blizzard of trouble tickets”; several sites have names that are a touch more colorful.Steps two and three, find the problem and – assuming it’s something within your area of assumed competence – fix it, often take the most time. Step two is increasingly challenging if your system just identifies symptoms rather than problems, telling you, for example, that the system “cannot write to disk XYZ; an I/O problem exists.” Such messages have little value beyond the fact that they are often punctuated correctly. Clearly their lack of actionable information tends to delay step three. At this point color the user community an increasingly annoyed shade of something approaching magenta, unless of course the people whose work was interrupted are in sales: in this case, color them ultraviolet.Eventually – the time involved here is pretty hard to define, but let’s assume it is somewhere between a half hour and a half-day – you get the problem fixed, get the systems back up, get all the dependant processes back in operation, let the various lines of business know that they can get back to work, and at last get back to your now-congealing coffee. How much of your company’s business was interrupted? How much money did your competitors make while your company’s systems were inaccessible? Who cares! You’re in IT, and that other stuff is some other department’s problem!It’s been another day in paradise.I write about root cause analysis frequently in this newsletter because of that technology’s potential to provide significant insight into why systems fail and the way in which they do so. Properly designed, such software can send you right to problem so you can fix it as rapidly as possible. Although no packages do a complete job of it, at least two that I know of provide focused knowledge: CentrePath’s Magellan and EMC’s Smarts take quite different approaches, but both get a lot of it right, focusing on identifying problems rather than just listing symptoms.Even these two don’t tell you all that you really need to know though. Wouldn’t it be useful to understand not just which storage or networking elements have caused an outage, but also what other elements in the system are going to be affected as well? Even better, how about a real-time system that let’s you know ahead of time what business processes are going to be affected if you pull a board?Next time we’ll speculate on where this might go. Related content feature 5 ways to boost server efficiency Right-sizing workloads, upgrading to newer servers, and managing power consumption can help enterprises reach their data center sustainability goals. By Maria Korolov Dec 04, 2023 9 mins Green IT Green IT Green IT news Omdia: AI boosts server spending but unit sales still plunge A rush to build AI capacity using expensive coprocessors is jacking up the prices of servers, says research firm Omdia. By Andy Patrizio Dec 04, 2023 4 mins CPUs and Processors Generative AI Data Center feature What is Ethernet? History, evolution and roadmap The Ethernet protocol connects LANs, WANs, Internet, cloud, IoT devices, Wi-Fi systems into one seamless global communications network. By John Breeden Dec 04, 2023 11 mins Networking news IBM unveils Heron quantum processor and new modular quantum computer IBM also shared its 10-year quantum computing roadmap, which prioritizes improvements in gate operations and error-correction capabilities. By Michael Cooney Dec 04, 2023 5 mins CPUs and Processors High-Performance Computing Data Center Podcasts Videos Resources Events NEWSLETTERS Newsletter Promo Module Test Description for newsletter promo module. Please enter a valid email address Subscribe