Some Microsoft services, including Outlook, Office 365, and Microsoft Teams, experienced a multi-hour outage on Monday, but the issues have been resolved, according to the company.
“We’ve confirmed that the residual issue has been addressed and the incident has been resolved,” Microsoft tweeted at 12AM ET on Tuesday. “Any users still experiencing impact should be mitigated shortly.”
The company first acknowledged issues at 5:44PM ET via the Microsoft 365 Status Twitter account, and said it had rolled back a change thought to be the cause of the issue at 6:36PM ET. But just 13 minutes later, the company tweeted again to say that it was “not observing an increase in successful connections after rolling back a recent change.” Microsoft tweeted that services were mostly back at 10:20PM ET.
Microsoft’s Azure Active Directory service was also experiencing issues on Monday, but the company said those were “now mitigated” as of 11:21PM ET Monday night. Microsoft said the problems were caused by a configuration change to a backend storage layer, which the company rolled back.
Update, September 29th, 11:20AM ET: Updated to confirm Microsoft has resolved the issues. The headline has also been updated to reflect this fact.
The core service affected was Azure Active Directory, which controls login to everything from Outlook email to Teams to the Azure portal, used for managing other cloud services. The five-hour impact was also felt in productivity-stopping annoyances like some installations of Microsoft Office and Visual Studio, even on the desktop, declaring that they could not check their licensing and therefore would not run.
There are claims that the US emergency 911 service was affected, which is not implausible given that the RapidDeploy Nimbus Dispatch system describes itself as “a Microsoft Azure–based Computer Aided Dispatch platform”. If the problem is authentication, even resilient services with failover to other Azure regions may become inaccessible and therefore useless.
The company has yet to provide full details, but a status report today said that “a recent configuration change impacted a backend storage layer, which caused latency to authentication requests”.
Microsoft seems to have more than its fair share of problems. Gartner noted recently that it “continues to have concerns related to the overall architecture and implementation of Azure, despite resilience-focused efforts and improved service availability metrics during the past year”. The analyst’s reservations were based in part on the low ratio of availability zones to regions, and that “a limited set of services support the availability zone model”.
Gartner’s concerns are valid, but this was not the cause of the recent disruption. Bill Witten, identity architect at Okta, was to the point, commenting: “So, does everyone get why the mono-directory is not a good idea?”
Microsoft has built so much on Azure Active Directory that it is a single point of failure. The company either needs to make it so resilient that failure is near-impossible (which is likely to be its intention), or consider gradually reducing the dependence of so many services.