During a routine periodic fire suppression system maintenance, an unexpected release of inert fire suppression agent occurred. When suppression was triggered, it initiated the automatic shutdown of Air Handler Units (AHU) as designed for containment and safety. While conditions in the data center were being reaffirmed and AHUs were being restarted, the ambient temperature in isolated areas of the impacted suppression zone rose above normal operational parameters. Some systems in the impacted zone performed auto shutdowns or reboots triggered by internal thermal health monitoring to prevent overheating of those systems.
However, some of the overheated servers and storage systems “did not shutdown in a controlled manner,” and it took a while to bring them back online.
As a result, virtual machines were axed to avoid any data corruption by keeping them alive. Azure Backup vaults were not available, and this caused backup and restore operation failures. Azure Site Recovery lost failover ability and HDInsight, Azure Scheduler and Functions dropped jobs as their storage systems went offline.
Azure Monitor and Data Factory showed serious latency and errors in pipelines, Azure Stream Analytics jobs stopped processing input and producing output, albeit only for a few minutes, and Azure Media Services saw failures and latency issues for streaming requests, uploads, and encoding.