A “unique” breakdown coupled with a previously unknown flaw in Exchange Online caused Tuesday’s extensive outage, and to make matters worse, the service disruption alert system also malfunctioned, leaving some affected customers in the dark.
So said Rajesh Jha, corporate vice president of Office 365 engineering, in an incident report posted to the Office 365 support forum in which he also addressed another separate, prolonged Lync Online outage from Monday.
“I want to apologize on behalf of the Office 365 team for the impact and inconvenience this has caused. Email and real-time communications are critical to your business, and my team and I fully recognize our accountability and responsibility as your partner and service provider,” he wrote.
For customers on U.S. Eastern time, the Exchange Online outage covered virtually the entire workday.
The main selling point from Microsoft, Google, Amazon and other providers of cloud software and computing services is that their customers don’t need to worry about maintaining on-premises servers, patching applications and rebooting systems that crash.
While no one expects even these mighty technology companies to be perfect, an email outage that lasts for almost nine hours during a workday is sure to plant the seeds of doubt on business managers about the wisdom of turning off their on-premises email servers and trusting this essential communications service to a cloud provider.
The second-guessing is bound to be even more intense when the email breakdown happens the day after a significant outage affecting Lync Online, which Office 365 customers use for instant messaging, presence, audio communications, video conferencing, Web meetings and, in some cases, IP telephony.
Many were IT professionals who were fielding complaints from their frazzled users, while having no control over the problem and little information from Microsoft about its cause and estimated time of resolution.
Jha addressed this breakdown in communications, saying that during the Exchange Online incident “we also experienced a problem with our Service Health Dashboard (SHD) publishing process, meaning not all impacted customers were notified in a timely way which we realize was frustrating and this has since been addressed.”
For Microsoft, back-to-back outages of this magnitude are poisonous, embroiled as it is in a vicious fight with Google in the cloud email and collaboration suite market.
Jha said the outages affected Office 365 data centers in North America, but he didn’t come close to clarifying how many customers were hit, which hurts Microsoft’s attempts at transparency. Asked for this information twice this week by the IDG News Service, Microsoft declined to provide it. Customers will receive a formal, detailed report on the incidents later, so maybe it will include details about the scope of the outages.
For now, Jha shared that the Exchange Online outage, which lasted roughly from 9 a.m. to 6 p.m. U.S. Eastern time, was triggered by a directory partition that stopped responding to authentication requests. This failure brought down Exchange Online for a “small set” of customers, but due to its “unique nature” it took Microsoft engineers a long time to fix it.
“Unfortunately, the nature of this failure led to an unexpected issue in the broader mail delivery system due to a previously unknown code flaw leading to mail flow delays for a larger set of customers,” Jha wrote.
Meanwhile, Monday’s Lync Online outage was triggered by a brief loss of client connectivity in Microsoft’s data centers due to “external network failures.”
“Even though connectivity was restored in minutes, the ensuing traffic spike caused several network elements to get overloaded, resulting in some of our customers being unable to access Lync functionality for an extended duration,” Jha wrote.
Microsoft is taking steps to prevent these specific issues from affecting Exchange Online and Lync Online again, according to Jha.