Cloudflare explains hour long outage which broke a lot of internets

The incident began at 0627 UTC (2327 Pacific Time) and it took until 0742 UTC (0042 Pacific) before the company managed to bring all its datacenters back online and verify they were working correctly. During this time a variety of sites and services relying on Cloudflare went dark while engineers frantically worked to undo the damage they had wrought short hours previously.

“The outage,” explained Cloudflare, “was caused by a change that was part of a long-running project to increase resilience in our busiest locations.”

Oh, the irony.

What had happened was a change to the company’s prefix advertisement policies, resulting in the withdrawal of a critical subset of prefixes. Cloudflare makes use of BGP (Border Gateway Protocol). As part of this protocol, operators define which policies (adjacent IP addresses) are advertised to or accepted from networks (or peers).

Changing a policy can result in IP addresses no longer being reachable on the Internet. One would therefore hope that extreme caution would be taken before doing a such a thing…

Cloudflare’s mistakes actually began at 0356 UTC (2056 Pacific), when the change was made at the first location. There was no problem – the location used an older architecture rather than Cloudflare’s new “more flexible and resilient” version, known internally as MCP (Multi-Colo Pop.) MCP differed from what had gone before by adding a layer of routing to create a mesh of connections. The theory went that bits and pieces of the internal network could be disabled for maintenance. Cloudflare has already rolled out MCP to 19 of its datacenters.

Moving forward to 0617 UTC (2317 Pacific) and the change was deployed to one of the company’s busiest locations, but not an MCP-enabled one. Things still seemed OK… However, by 0627 UTC (2327 Pacific), the change hit the MCP-enabled locations, rattled through the mesh layer and… took

Five minutes later the company declared a major incident. Within half an hour the root cause had been found and engineers began to revert the change. Slightly worryingly, it took until 0742 UTC (0042 Pacific) before everything was complete. “This was delayed as network engineers walked over each other’s changes, reverting the previous reverts, causing the problem to re-appear sporadically.”

One can imagine the panic at Cloudflare towers, although we cannot imagine a controlled process that resulted in a scenario where “network engineers walked over each other’s changes.”

We’ve asked the company to clarify how this happened, and what testing was done before the configuration change was made, and will update should we receive a response.

Mark Boost CEO of Cloud native outfit Civo (formerly of LCN.com) was scathing regarding the outage: “This morning was a wake-up call for the price we pay for over-reliance on big cloud providers. It is completely unsustainable for an outage with one provider being able to bring vast swathes of the internet offline.

“Users today rely on constant connectivity to access the online services that are part of the fabric of all our lives, making outages hugely damaging…

“We should remember that scale is no guarantee of uptime. Large cloud providers have to manage a vast degree of complexity and moving parts, significantly increasing the risk of an outage.”

Source: Cloudflare explains today’s mega-outage • The Register

Robin Edgar

Organisational Structures | Technology and Science | Military, IT and Lifestyle consultancy | Social, Broadcast & Cross Media | Flying aircraft

 robin@edgarbv.com  https://www.edgarbv.com