Facebook, Instagram, and WhatsApp weren’t offline, they couldn’t be found for six hours. But it comes to the same …
This Monday evening, the various Facebook services experienced a massive outage. the main social network, but also Messenger, Instagram, WhatsApp and even Oculus games remained inaccessible for nearly six hours. A record for a social network failure which above all shows the technical difficulties encountered by the Facebook teams in solving this problem.
After the outage was resolved, Mark Zuckerberg’s group issued a short statement to explain the outage and apologize to its users. “We have worked as hard as possible to restore access and our systems are now operating normally. The underlying cause of this failure also affected many of the internal systems and tools we use in our day-to-day operations, complicating our attempts to quickly diagnose and resolve the problem ”.
A change in configuration involved
Facebook thus indicates that its failure would have been caused by “Configuration changes in routers that coordinate traffic between data centers”. These configuration changes had several consequences on communications between the various servers of the American group. “We want to assure you that we believe the causes of this failure are a bad configuration change. We have no indication that user data was compromised during the outage ”, also wants to reassure Facebook. If the group quickly spoke about the causes of the outage it suffered on Monday, the indications remain vague and we must turn to Cloudflare to better understand the implications.
“Facebook can’t be down, can it?”, we thought, for a second. Well it can, and here’s how. https://t.co/V0fW2n0a4I
— Cloudflare (@Cloudflare) October 4, 2021
As a reminder, Cloudflare is a content distribution network on the Internet, responsible in particular for acting as an intermediary between the Internet user and the server, in particular to protect sites from DDoS attacks. With its technical expertise, Cloudflare published on Monday a detailed explanation of the outage suffered by Facebook.
Before continuing, two concepts should be taken into account: DNS and BGP. Commonly said, DNS gives the IP address of the website, its location, while BGP indicates the route to reach that destination. The DNS, for Domain Name System, is a service at the base of the Internet which will make it possible to transform a URL such as “Facebook.com” or “Frandroid.com” into an IP address – such as 188.8.131.52 – in order to allow your browser to know which address to connect to anywhere in the world. For its part, BGP – for Border Gateway Protocol – will tell your browser the way to go to reach the target address.
Facebook was online, but not found
Concretely, it seems that it is a change of BGP configuration which is at the origin of the failure of the Facebook services. “BGP allows a network – like Facebook – to promote its presence to other networks that make up the Internet. At the time of writing, Facebook is no longer promoting its presence, access providers and other networks can no longer find Facebook’s network and it is therefore unavailable ”, indicates Cloudflare. In fact, Facebook therefore remained online, but Internet operators no longer knew how to access it.
— David Larose 🐧 (@Pentar0o) October 4, 2021
For Facebook employees, this change in BGP configuration had other consequences that explain the slowness of the repair of the failure. First, Facebook employees using @ facebook.com email addresses, messages had to connect to Facebook servers to be sent. However, in the absence of a server, it is impossible to communicate by email. Moreover, the servers being disconnected from the Internet network, it is also impossible to reconfigure them remotely. Employees had to go directly to the data centers to interact locally with the computers to change the configuration.
It was finally after being able to enter the data centers that Facebook employees were able to reconfigure the BGP configuration and once again make the various services visible to the entire Internet network.