Cloudflare: Config change borked net access for all

There was a disturbance in the force on July 14 after Cloudflare borked a configuration change that resulted in an outage, impacting internet services across the planet.

In a blog post, the content delivery network services biz detailed the unfortunate series of events that led to Monday’s disruption.

On the day itself, “Cloudflare’s 1.1.1.1 Resolver service became unavailable to the internet starting at 21:52 UTC and ending at 22:54 UTC. The majority of 1.1.1.1 users globally were affected. For many users, not being able to resolve names using the 1.1.1.1 Resolver meant that basically all Internet services were unavailable,” Cloudflare said.

But the problem originated much earlier.

The outage was caused by a “misconfiguration of legacy systems” which are used to uphold the infrastructure advertising Cloudflare’s IP addresses to the internet.

“The root cause was an internal configuration error and not the result of an attack or a BGP hijack,” the corp said.

Back on June 6 this year, as Cloudflare was preparing a service topology for a future Data Localization Suite (DLS) service, it introduced the config gremlin – prefixes connected to the 1.1.1.1 public DNS Resolver were “inadvertently included alongside the prefixes that were intended for the new DLS service.”

“This configuration error sat dormant in the production network as the new DLS service was not yet in use,  but it set the stage for the outage on July 14. Since there was no immediate change to the production network there was no end-user impact, and because there was no impact, no alerts were fired.”

On July 14, a second tweak to the service was made: Cloudflare added an offline datacenter location to the service topology for the pre-production DNS service in order “to allow for some internal testing.” But the change triggered a refresh of the global configuration of the associated routes, “and it was at this point that the impact from the earlier configuration error was felt.”

Things went awry at 2148 UTC.

“Due to the earlier configuration error linking the 1.1.1.1 Resolver’s IP addresses to our non-production service, those 1.1.1.1 IPs were inadvertently included when we changed how the non-production service was set up… The 1.1.1.1 Resolver prefixes started to be withdrawn from production Cloudflare datacenters globally.”

Traffic began to drop four minutes later and internal health alerts started to emerged. An “incident” was declared at 2201 UTC and a fix dispatched at 2220 to restore the previous configuration.

“To accelerate full restoration of service, a manually triggered action is validated in testing locations before being executed,” Cloudflare said in its explanation of the outage. Revolver alerts were cleared by 2254 UTC and DNS traffic on Resolver prefixes went back to typical levels, it added.

Data on DNSPerf shared with us by a reader indicates a length of the disruption of around three hours, far longer than Cloudflare’s summary suggests.

As a Reg reader pointed out: “Remember this is a DNS service. Every person using the service would have had no ability to use the internet. Every business using Cloudflare had no internet for the length of the outage. NO DNS = NO INTERNET.” ®

Source: Cloudflare: Config change borked net access for all • The Register

Robin Edgar

Organisational Structures | Technology and Science | Military, IT and Lifestyle consultancy | Social, Broadcast & Cross Media | Flying aircraft

 robin@edgarbv.com  https://www.edgarbv.com