Skip Links

Network World

Kerrie Meyler

Google’s email outage and the fallacy of 3 nine’s

Controlling outages through high availability and monitoring

By Kerrie Meyler on Mon, 09/07/09 - 10:59am.

Last Tuesday, September 1, Google's web-based email went offline. This follows another outage in May. The September outage was caused by engineers taking some servers offline and inadvertently overloading other servers as a result. May's outage was caused by a traffic routing error. There were previous outages earlier in May, one in mid-April, and another in March.

What does this mean?

For one thing, both Google and Hotmail promise 99.9% availability (also known "3 nines"). As good as that sounds, 3 nines doesn't matter during that other one-tenth of one percent. 3 nines also equates to nearly 44 minutes of unscheduled downtime per month, or 8 hours and 45 minutes per year. The more nines, the more uptime you have - with 4 nines only 52 minutes of downtime per year, and the vaunted 5 nines delivering 5 minutes 16 seconds of unscheduled downtime in a year's time. To be considered truly available, 3 nines really doesn't cut it. (And this is only regarding unscheduled downtime, not maintenance windows!)

While I don't know how Google structures its operations, there are some other areas to look at here. How does one control unscheduled downtime? Building redundancy, a.k.a high availability, into one's operations certainly can help. If Google had backup servers they could use or had clustered their servers, that may have prevented the last outage.

Another approach that can be helpful is monitoring your production environment to know what's going on. If you could see that other servers were starting to get overloaded, you could proactively bring on other servers or reassign the workload, perhaps using virtualization to do this quickly.

Monitoring tools and virtualization are not new technologies. Back in the days of "big iron" and mainframes, IBM had VM as an operating system and both they and third parties had monitoring tools. Today we also have virtualization technologies and monitoring tools. This may or may not get you to 6 nines of availability, but being able to proactively monitor your production environment and architect high availability can make a big difference.

prime-time vs. after-hours

0

Just as important as "how many nines" is the question, "what time is it?" Google needs to formally identify prime-time hours of service, that is, times of the day when its service is considered critical to most of its customer base and during which they will not make changes to the service, intentional or otherwise.

I think Google would agree that its North American user community is the largest customer of Gmail, by far, and that 6am-midnight (eastern time) is the heaviest usage period. A hands-off policy during that time frame would go a long way to restore public confidence in the service.

There probably does not exist a software or hardware engineer who isn't susceptible to the "this change can't possibly have a negative effect" syndrome. The management of large-scale IT services has to recognize this and take the decision out of the hands of the engineers by requiring that ANY change be made outside the prime-time period. If that means that said engineers have to be functional in the wee hours of the morning, well, welcome to real life. That's how most of us did software development for most of our careers.

Cost vs Benefit

0

How much does each extra 9 cost?

Each digit is an extra magnitude in both benefit (as explained in the article) and cost which was totally ignored. We explained the additional cost of _guaranteed_ being up 24/7 vs _guaranteed_ being up 20/7 to our bosses and they went with the far, far cheaper option (and we're mostly up 24/7).

gmail is free.

(Read the recent article on mp3 being good enough not for sound quality but for sound AND the ability to carry a wad of music around with earbuds).

What You Address is ROI

0

Yes, even a "cloud" system has ROI. In the case of free Goggle Apps and similar, part of the Investment equation of "ROI" for a business is the company's perception and reputation as seen from the eyes of that company's customer base. Regardless of how "inexpensive" the "cloud" service appears, that Investment is basically hanging the business's butt out on the line. Reputation and Trust are very-hard-won assets, taking years to build up, and only very short time to destroy.

If a business has a service model where 9:00am to 5:00pm is a completely satisfactory expectation for their customers, then a mere 2-9's of up-time may even be acceptable, and very cheaply purchased. But as pointed out, that is only applicable if the 1% of down time falls completely outside of the customer's expected service hours. With 99% up-time, that allows for 15 minutes of unexpected downtime per 24-hour day. Statistically, that could be 4 3/4 minutes during an 8-hour business day. Will your customers wait for you to wrangle with an expected random 5 minute delay per 8-hour business day? Or with 3-9's, an expected random delay of 3/4 minutes per 8-hour day?

For busnesses which are national, or perhaps even global in scope, the 'safe' time period in which the down-time might randomly occur rapidly vaporizes. Nearly two minutes at a random time of day, every day, for a company that operates 24/7 may or may not be acceptable, depending upon the type of business. If an important client tried to contact you during that specific 2-minute outage time, how much is their trust of your company diminished? Or your reputation as a business partner diminished? If the only "customer" of your cloud apps subscriptions are your internal office workers, that may still have little adverse effect, unless it grates on their nerves and they start to jump ship due to missed-deadline pressures. If that cloud app happens to be an online shopping cart and you run a web-based retail business, that outage may potentially cost you thousands of dollars per day when customers quickly browse to a more "trustworthy" site.

Carefully address the ROI when considering 'cloud' applications. There can be serious lost 'R' for skimping on the 'I'.

I have to say it is pretty

0

I have to say it is pretty arrogant of a blogger to be suggesting to know how to provide a high availability to a world-wide audience than Google. To make statements of what they could have or should have done is pretty bold. In their explanation of the outage, they freely admitted it was a mistake and it was caused by not properly anticipating the needed capacity. That alone tells me they are on top of the situation and fully understand what needs to happen to mitigate the chances of it happening again.

Further, you clearly have no idea what you are talking about. In one sentence you state it was a network routing issue and in the next you state that maybe if they had 'redundant servers' there wouldn't have been an outage. So what you are saying is a network routing issue could have been prevented by more servers. How do you form this conclusion? That is like saying, a detour on an interstate could have been avoided by having more cars getting onto the road.

It is piss poor reporting like this as to why I stopped reading your magazine. I should have know better than to read an article on your site.

I'm sorry you didn't read the entire posting

0

If you had, you would have realized I was referring to more than one outage. One was a server capacity issue, another was due to a network routing issue. This information was provided by Google, I didn't make it up.

I actually was not presumptious enough to tell Google what to do. I suggested that perhaps if if an organization hassome sort of operations management technology in place, they might be able to take some proactive actions. I have no idea what Google is using to manage their data centers.

The posting was a discussion of why maintaining high availabilty is important, and tools such as virtualization can get there. HA is a large issue and takes more than a short blog entry. This was not an attack on Google (or Hotmail) and it is unfortunate that you jumped to conclusions without reading the entire posting thoroughly.

Kerrie Meyler

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • You can use BBCode tags in the text.
  • Lines and paragraphs break automatically.
  • Allowed HTML tags: <p> <strong> <i> <br /> <br> <ul> <ol> <li> <dl> <dt> <dd> <blockquote>

More information about formatting options

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Welcome, visitor. Register Log in
About Managing Microsoft

Kerrie Meyler, MVP, MCSE, MCTS, CNA, MA, BA, is an independent consultant and trainer with over fifteen years of experience in IT. While at Microsoft in Field Technical Sales for four years she focused on infrastructure and mangement, presenting at numerous product launches. Kerrie has presented Operations Manager 2007 at TechEd 2007 and MMS 2009 and at internal Microsoft conferences, receiving company recognition and awards including a SPAR MGS award. Kerrie worked with Microsoft Learning to develop functional specifications for the original Operations Manager Microsoft courseware, 2550: Implementing Microsoft Operations Manager 2000 and did the beta teach for that course.She also participated in the alpha walkthrough for the 70-400: Configuring Microsoft System Center Operations Manager certification exam.

She is the lead author of Microsoft Operations Manager 2005 Unleashed, Microsoft System Center Operations Manager 2007 Unleashed, and Microsoft System Center Configuration Manager (SCCM) 2007 Unleashed. Kerrie is currently developing an eBook on Operations Manager 2007 R2.

Check out an excerpt from System Center Operations Manager 2007 Unleashed, Chapter 3: Looking Inside OpsMgr.

Kerrie's latest book, System Center Configuration Manager (SCCM) 2007 Unleashed by Kerrie Meyler, Byron Holt, and Greg Ramsey has been selected as the August, 2009, Microsoft Subnet book giveaway (a $59.99 value). Check out an excerpt from System Center Configuration (SCCM) Manager 2007 Unleashed, Chapter 3: Looking Inside ConfigMgr.

Visit the Microsoft Subnet home page for giveaway details and entry forms.