It’s hard to stay on top of everything all the time so it’s understandable that something like renewing a security certificate could fall through the cracks as it did to Microsoft last week, grinding its Azure Cloud Service to a halt.
But if you provide a critical service to corporate customers, routine updates - like renewing certificates before they expire – ought to be just another routine part of doing business, details that gets taken care of in a routine way.
Apparently if there was such a routine it somehow broke down. Microsoft says is still sorting out what went wrong in order to prevent something similar from happening in the future.
Meanwhile businesses using Azure Cloud Service should reevaluate how much they entrust to it. They should have done this in the first place before buying the service, but even if they did it doesn’t hurt to review based on the outage.
Business-critical data that must be accessible all the time clearly does not belong in the Azure cloud unless it’s also available someplace else.
“All the time” is a tall order, something that even private storage could fail to achieve. The standard for most service providers – established by phone companies – is 99.999% uptime. That means downtime of just 25.9 seconds per month.
Microsoft’s SLA for Azure Storage Service kicks in when the monthly uptime percentage drops to 99.9%, which means downtime of 43.8 minutes per month. At that point customers are eligible for a 10% service credit, according to Microsoft’s SLA for the service.
If uptime drops to 99% - which translates to 7.2 hours per month downtime – customers are entitled to a 25% credit. Friday's outage was so bad that Microsoft says it will waive the requirement that customers report that service failures within 5 business days. The company is automatically crediting affected customers, according to a Microsoft blog written by Steven Martin the general manager of Windows Azure Business & Operations.
According to Microsoft’s timeframe the outage lasted from 3:44 p.m. Eastern Friday to 4 a.m. Eastern Saturday when more than 99% of customers had service restored. That’s about 11 hours, 16 minutes of downtime, which is below the 99% threshold for awarding a 25% service credit.
Getting a credit is great as far as it goes, but SLAs don’t prevent downtime. They just give providers an incentive to minimize it, and as this case shows they don’t always succeed. Azure had another outage just about a year ago for different reasons and affecting just its management services.
These two events don’t condemn Azure services, but they should encourage customers to carefully consider what types of data these services are appropriate for and what types they are not.
More on Microsoft: