If you've read my bio, you will notice I wrote a book (actually I was the lead author) - MOM 2005 Unleashed. We're in the final stages of writing the follow-up to that, Systems Center Operations Manger 2007 Unleashed. I mention this because it probably would be worthwhile to post some articles related to operations management - what that means, what it's all about; and potentially to get specific about the Microsoft product (without stealing too much thunder from the book of course!).
I'd like to talk a bit here about why unscheduled downtime is a bad thing. Obviously its not a good thing if you can't get to your email or run your business applications when you were expecting to, but it is always interesting to try to quantify why it's a bad thing - e.g. put some hard numbers to it.
We can start with a simplified example of the impact of temporarily disrupting an e-commerce site normally available 7x24. The site generates an average of $4,000 per hour in revenue from customer orders for an annual value in sales revenue of $35,040,000 US. If the website were unavailable for six hours due to a security vulnerability, the directly attributable losses for the outage would be $24,000 US.
This number is only an average cost; most e-commerce sites generate revenue at a wide range of rates based on time of day, date of week, time of year, marketing campaigns, and so on. Typically the outage occurs during peak times when the system is already stressed, greatly increasing the cost of a 6-hour loss.
There are other costs incurred from an outage. Some customers may decide to find alternative vendors, resulting in a permanent loss of users and making the revenue loss even higher than the direct loss of sales. The company may decide to spend additional money on advertising to counter the ill will created when customers could not reach the site. The costs from our example 6-hour outage can thus be far higher than its simple hourly proportion of time applied to an average revenue stream.
Another case in point would be a large-sized credit-card processing card company that estimates it would stand to lose nearly $400,000 in direct revenue if they experienced a one-hour operational outage affecting their ability to process credit-card transactions. This number assumes an estimated cost of just over $1.00 per missed transaction, and does not include the inevitable decline in revenues due to a loss of confidence from clients were such an outage to happen.
Does this actually happen? Let's look at some real cases. We can look at what happened on Black Friday in 2006, which refers to the day after Thanksgiving in the United States and is the busiest day of the year in the retail sector. On that particular Black Friday, the websites for two very large U.S. retailers (Wal-Mart and Macy's) were unavailable starting around 4:00 a.m. for approximately 10 hours, presumably from overload. While it is possible that the potential customers tried the sites at a later time, it is also possible that they took their business to competitors.
There are two types of downtime of course - scheduled (which usually is a very small window in the wee hours of the morning on a weekend), and unscheduled - the type you can't plan around and what happened on Black Friday last year. Managing IT Operations means we want to take actions to mitigate the possiblity of the unscheduled variety.
You may have heard of something called "5 9's of availability". This means scheduled uptime is 99.999%, which works out to about 5 minutes of unscheduled downtime in a year. That would be high availability - sounds like nirvana! Getting 5 9's takes work to attain and maintain. Most companies are happy to get 99.9% uptime. This doesn't mean you don't take systems down for maintenance - but that you manage your systems ro reduce the unplanned outages.
And in a nutshell, that's what operations management is all about.
Kerrie Meyler, MVP, MCSE, MCTS, CNA, MA, BA, is an independent consultant and trainer with over fifteen years of experience in IT. While at Microsoft in Field Technical Sales for four years she focused on infrastructure and mangement, presenting at numerous product launches. Kerrie has presented Operations Manager 2007 at TechEd 2007 and MMS 2009 and at internal Microsoft conferences, receiving company recognition and awards including a SPAR MGS award. Kerrie worked with Microsoft Learning to develop functional specifications for the original Operations Manager Microsoft courseware, 2550: Implementing Microsoft Operations Manager 2000 and did the beta teach for that course.She also participated in the alpha walkthrough for the 70-400: Configuring Microsoft System Center Operations Manager certification exam.
She is the lead author of Microsoft Operations Manager 2005 Unleashed, Microsoft System Center Operations Manager 2007 Unleashed, and Microsoft System Center Configuration Manager (SCCM) 2007 Unleashed. Kerrie is currently developing an eBook on Operations Manager 2007 R2.
Check out an excerpt from System Center Operations Manager 2007 Unleashed, Chapter 3: Looking Inside OpsMgr.
Kerrie's latest book, System Center Configuration Manager (SCCM) 2007 Unleashed by Kerrie Meyler, Byron Holt, and Greg Ramsey has been selected as the August, 2009, Microsoft Subnet book giveaway (a $59.99 value). Check out an excerpt from System Center Configuration (SCCM) Manager 2007 Unleashed, Chapter 3: Looking Inside ConfigMgr.
Visit the Microsoft Subnet home page for giveaway details and entry forms.
Change and Downtime
Agree with your thoughts on downtime, Kerrie. We recently conducted some research on top causes of downtime and we found that most organizations view changes as the leading cause of downtime. In a world where changes are increasing while the IT operations teams continue their quest for "5 9's of availability,” it is going to be important to better understand the impacts of change on production. Numerous high-profile companies have experienced downtime because they didn’t properly test changes.
The research can be found at www.stacksafe.com/research
Change managment
Joe, I would certainly agree that unmanaged changes have a lot to do with downtime. (Or course, even "managed changes" can cause that as well when their full repercussions are not well understood in a development / testing environment!)
Generally when there is a problem, the first question that comes up is "what changed?" Often the answer is a shrug of the shoulders and a corresponding "dunno." That happens when not all changes are documented. The unfortunate thing is that there may be changes that were documented and perceived as having no possible impact on the new problem ... and sometimes they weren't yesterday, they were a month ago, and it just took that long for the impact to be felt (or build up) across the network.
My own experience has been that the vast majority of downtime occurs from software level errors and user errors. It is surprising to realize that hardware accounts for only a small percentage of problems; to minimize system downtime, the software and user components need to be focused on. That's much harder than hardware, unfortunately!
Kerrie Meyler