Facebook resolves day-long outages across Instagram, WhatsApp, and Messenger
Facebook had problems loading images, videos, and other data across its apps today, leaving some people unable to load photos in the Facebook News Feed, view stories on Instagram, or send messages in WhatsApp. Facebook said earlier today it was aware of the issues and was “working to get things back to normal as quickly as possible.” It blamed the outage on an error that was triggered during a “routine maintenance operation.”
As of 7:49PM ET, Facebook posted a message to its official Twitter account saying the “issue has since been resolved and we should be back at 100 percent for everyone. We’re sorry for any inconvenience.” Instagram similarly said its issues were more or less resolved, too.
Earlier today, some people and businesses experienced trouble uploading or sending images, videos and other files on our apps. The issue has since been resolved and we should be back at 100% for everyone. We’re sorry for any inconvenience.— Facebook Business (@FBBusiness) July 3, 2019
The issues started around 8AM ET and began slowly clearing up after a couple hours, according to DownDetector, which monitors website and app issues. The errors aren’t affecting all images; many pictures on Facebook and Instagram still load, but others are appearing blank. DownDetector has also received reports of people being unable to load messages in Facebook Messenger.
The outage persisted through mid-day, with Facebook releasing a second statement, where it apologized “for any inconvenience.” Facebook’s platform status website still lists a “partial outage,” with a note saying that the company is “working on a fix that will go out shortly.”
Apps and websites are always going to experience occasional disruptions due to the complexity of services they’re offering. But even when they’re brief, they can become a real problem due to the huge number of users many of these services have. A Facebook outage affects a suite of popular apps, and those apps collectively have billions of users who rely on them. That’s a big deal when those services have become critical for business and communications, and every hour they’re offline or acting strange can mean real inconveniences or lost money.
We’re aware that some people are having trouble uploading or sending images, videos and other files on our apps. We’re sorry for the trouble and are working to get things back to normal as quickly as possible. #facebookdown— Facebook (@facebook) July 3, 2019
The issue caused some images and features to break across all of Facebook’s apps
Well, folks, Facebook and its “family of apps” has experienced yet another crash. A nice respite moving into the long holiday weekend if you ask me.
Problems that appear to have started early Wednesday morning were still being reported as of the afternoon, with Instagram, Facebook, WhatsApp, Oculus, and Messenger all experiencing issues. According to DownDetector, issues first started cropping up on Facebook at around 8am ET.
“We’re aware that some people are having trouble uploading or sending images, videos and other files on our apps. We’re sorry for the trouble and are working to get things back to normal as quickly as possible,” Facebook tweeted just after noon on Wednesday. A similar statement was shared from Instagram’s Twitter account.
You know what we definitely need more of on social media? Influencers and ads. And lucky for us,…Read more
Oculus, Facebook’s VR property, separately tweeted that it was experiencing “issues around downloading software.”
Facebook’s crash was still well underway as of 1pm ET on Wednesday, primarily affecting images. Where users typically saw uploaded images, such as their profile pictures or in their photo albums, they instead saw a string of terms describing Facebook’s interpretation of the image. Like this:
TechCrunch’s Zack Whittaker noted on Twitter that all of those image tags you may have seen were Facebook’s machine learning at work.
This week’s crash is just the latest in what has become a near semi-frequent occurrence of outages. The first occurred back in March in an incident that Facebook later blamed on “a server configuration change.” Facebook and its subsidiaries went down again about a month later, though the previous incident was much worse, with millions of reports on DownDetector.
Two weeks ago, Instagram was bricked and experienced ongoing issues with refreshing feeds, loading profiles, and liking images. While the feed refresh issue was quickly patched, it was hours before the company confirmed that Instagram had been fully restored.
We’ve reached out to Facebook for more information about the issues and will update this post if we hear back.
Code crash? Russian hackers? Nope. Good ol’ broken fiber cables borked Google Cloud’s networking today
Fiber-optic cables linking Google Cloud servers in its us-east1 region physically broke today, slowing down or effectively cutting off connectivity with the outside world.
For at least the past nine hours, and counting, netizens and applications have struggled to connect to systems and services hosted in the region, located on America’s East Coast. Developers and system admins have been forced to migrate workloads to other regions, or redirect traffic, in order to keep apps and websites ticking over amid mitigations deployed by the Silicon Valley giant.
Starting at 0755 PDT (1455 UTC) today, according to Google, the search giant “experiencing external connectivity loss for all us-east1 zones and traffic between us-east1, and other regions has approximately 10% loss.” I got 502 problems, and Cloudflare sure is one: Outage interrupts your El Reg-reading pleasure for almost half an hour READ MORE
By 0900 PDT, Google revealed the extent of the blunder: its cloud platform had “lost multiple independent fiber links within us-east1 zone.” The fiber provider, we’re told, “has been notified and are currently investigating the issue. In order to restore service, we have reduced our network usage and prioritised customer workloads.”
By that, we understand, Google means it redirected traffic destined for its Google.com services hosted in the data center region, to other locations, allowing the remaining connectivity to carry customer packets.
By midday, Pacific Time, Google updated its status pages to note: “Mitigation work is currently underway by our engineering team to address the issue with Google Cloud Networking and Load Balancing in us-east1. The rate of errors is decreasing, however some users may still notice elevated latency.”
However, at time of writing, the physically damaged cabling is not yet fully repaired, and US-east1 networking is thus still knackered. In fact, repairs could take as much as 24 hours to complete. The latest update, posted 1600 PDT, reads as follows:
The disruptions with Google Cloud Networking and Load Balancing have been root caused to physical damage to multiple concurrent fiber bundles serving network paths in us-east1, and we expect a full resolution within the next 24 hours.
In the meantime, we are electively rerouting traffic to ensure that customers’ services will continue to operate reliably until the affected fiber paths are repaired. Some customers may observe elevated latency during this period.
Customers using Google Cloud’s Load Balancing service will automatically fall over to other regions, if configured, minimizing impact on their workloads, it is claimed. They can also migrate to, say US-east4, though they may have to rejig their code and scripts to reference the new region.
The Register asked Google for more details about the damaged fiber, such as how it happened. A spokesperson told us exactly what was already on the aforequoted status pages.
Meanwhile, a Google Cloud subscriber wrote a little ditty about the outage to the tune of Pink Floyd’s Another Brick in the Wall. It starts: “We don’t need no cloud computing…” ®
Source: The Register
This major Cloudflare internet routing blunder took A WEEK to fix. Why so long? It was IPv6 – and no one really noticed
Last week, an internet routing screw-up propagated by Verizon for three hours sparked havoc online, leading to significant press attention and industry calls for greater network security.
A few weeks before that, another packet routing blunder, this time pushed by China Telecom, lasted two hours, caused significant disruption in Europe and prompted some to wonder whether Beijing’s spies were abusing the internet’s trust-based structure to carry out surveillance.
In both cases, internet engineers were shocked at how long it took to fix traffic routing errors that normally only last minutes or even seconds. Well, that was nothing compared to what happened this week.
Cloudflare’s director of network engineering Jerome Fleury has revealed that the routing for a big block of IP addresses was wrongly announced for an ENTIRE WEEK and, just as amazingly, the company that caused it didn’t notice until the major blunder was pointed out by another engineer at Cloudflare. (This cock-up is completely separate to today’s Cloudflare outage.)
How is it even possible for network routes to remain completely wrong for several days? Because, folks, it was on IPv6.
“So Airtel AS9498 announced the entire IPv6 block 2400::/12 for a week and no-one notices until Tom Strickx finds out and they confirm it was a typo of /127,” Fleury tweeted over the weekend, complete with graphic showing the massive routing error.
That /12 represents 83 decillion IP addresses, or four quadrillion /64 networks. The /127 would be 2. Just 2 IP addresses. Slight difference. And while this demonstrates the expansiveness of IPv6’s address space, and perhaps even its robustness seeing as nothing seems to have actually broken during the routing screw-up, it also hints at just how sparse IPv6 is right now.
To be fair to Airtel, it often takes someone else to notice a network route error – typically caused by simple typos like failing to add a “7” – because the organization that messes up the tables tends not to see or feel the impact directly.
But if ever there was a symbol of how miserably the transition from IPv4 to IPv6 is going, it’s in the fact that a fat IPv6 routing error went completely unnoticed for a week while an IPv4 error will usually result in phone calls, emails, and outcry on social media within minutes.
And sure, IPv4 space is much, much more dense than IPv6 so obviously people will spot errors much faster. But no one at all noticed the advertisement of a /12 for days? That may not bode well for the future, even though, yes, this particular /127 typo had no direct impact.
Source: The Register
I got 502 problems, and Cloudflare sure is one: Outage interrupts your El Reg-reading pleasure for almost half an hour
Updated Cloudflare, the outfit noted for the slogan “helping build a better internet”, had another wobble today as “network performance issues” rendered websites around the globe inaccessible.
The US tech biz updated its status page at 1352 UTC to indicate that it was aware of issues, but things began tottering quite a bit earlier. Since Cloudflare handles services used by a good portion of the world’s websites, such as El Reg, including content delivery, DNS and DDoS protection, when it sneezes, a chunk of the internet has to go and have a bit of a lie down. That means netizens were unable to access many top sites globally.
While Cloudflare implemented a fix by 1415 UTC and declared things resolved by 1457 UTC, a good portion of internet users noticed things had gone very south for many, many sites.
The company’s CEO took to Twitter to proffer an explanation for why things had fallen over, fingering a colossal spike in CPU usage as the cause while gently nudging the more wild conspiracy theories away from the whole DDoS thing.
However, the outage was a salutary reminder of the fragility of the internet as even Firefox fans found their beloved browser unable to resolve URLs.
Ever keen to share in the ups and downs of life, even Cloudflare’s site also reported the dread 502 error.
As with the last incident, users who endured the less-than-an-hour of disconnection would do well to remember that the internet is a brittle thing. And Cloudflare would do well to remember that its customers will be pondering if maybe they depend on its services just a little too much.
Updated to add at 1702 BST
Following publication of this article, Cloudflare released a blog post stating the “CPU spike was caused by a bad software deploy that was rolled back. Once rolled back the service returned to normal operation and all domains using Cloudflare returned to normal traffic levels.”
Naturally it then added….
“We are incredibly sorry that this incident occurred. Internal teams are meeting as I write performing a full post-mortem to understand how this occurred and how we prevent this from ever occurring again.” ®
Source: The Register
Cloudflare gave everyone a 30-minute break from a chunk of the internet yesterday: Here’s how they did it
Internet services outfit Cloudflare took careful aim and unloaded both barrels at its feet yesterday, taking out a large chunk of the internet as it did so.
In an impressive act of openness, the company posted a distressingly detailed post-mortem on the cockwomblery that led to the outage. The Register also spoke to a weary John Graham-Cumming, CTO of the embattled company, to understand how it all went down.
This time it wasn’t Verizon wot dunnit; Cloudflare engineered this outage all by itself.
In a nutshell, what happened was that Cloudflare deployed some rules to its Web Application Firewall (WAF). The gang deploys these rules to servers in a test mode – the rule gets fired but doesn’t take any action – in order to measure what happens when real customer traffic runs through it.
We’d contend that an isolated test environment into which one could direct traffic would make sense, but Graham-Cumming told us: “We do this stuff all the time. We have a sequence of ways in which we deploy stuff. In this case, it didn’t happen.”
It all sounds a bit like the start of a Who, Me?
In a frank admission that should send all DevOps enthusiasts scurrying to look at their pipelines, Graham-Cumming told us: “We’re really working on understanding how the automated test suite which runs internally didn’t pick up the fact that this was going to blow up our service.”
The CTO elaborated: “We push something out, it gets approved by a human, and then it goes through a testing procedure, and then it gets pushed out to the world. And somehow in that testing procedure, we didn’t spot that this was going to blow things up.”
He went on to explain how things should happen. After some internal dog-fooding, the updates are pushed out to a small group of customers “who tend to be a little bit cheeky with us” and “do naughty things” before it is progressively rolled out to the wider world. Cloudflare hits the deck, websites sink from sight after the internet springs yet another BGP leak READ MORE
“And that didn’t happen in this instance. This should have been caught easily.”
The result? “One of these rules caused the CPU spike to 100 per cent, on all of our machines.” And because Cloudflare’s products are distributed over all its servers, every service was starved of CPU while the offending regular expression did its thing.
Source: The Register