As the old saying goes: the bigger they are, the harder they fall. By almost any metric, there are few things in the world today on the scale of Facebook. On 4th October 2021, we watched the drama unfold as this global giant, together with its Messenger, Instagram, WhatsApp, Mapillary and Oculus subsidiaries, literally vanished into thin air. For six hours, around 2.8 billion active users worldwide were left in a state of limbo. Everyone was left to wonder when, or if, the service would come back online.
Of course, Facebook did (in the end) return. Although, at the cost of massive reputational damage and financial hit. According to The Times, the outage cost Facebook around $100 million in lost advertising sales. And, its shares fell 5%, wiping an almost unimaginable $47 billion from its market value. As bad days at the office go, this will take some beating – and not just for Facebook. Millions of people rely on its services for some, or all, of their livelihood.
Happily, most of us as IT leaders will never suffer an outage of this magnitude or have to deal with such enormous and far-reaching consequences. However, Facebook’s very public travails are a timely reminder of the importance of good governance as our adoption of, and dependence on cloud-based services increases.
One of the cloud’s key attractions is that you don’t have servers and other on-site hardware to maintain. But as another old expression has it: out of sight, out of mind. It’s all too easy to forget that the cloud doesn’t eliminate infrastructure; it just moves it somewhere else. Your cloud-based services still require huge amounts of complex components, and lots of very clever people to look after them. And crucially, neither element is perfect. Hence no SLA ever guarantees 100% availability. There’s always that 0.001% chance that something, or someone, will fall down on the job.
For me, there are several key lessons we can take from this unprecedented debacle. First and foremost, it’s probably the most potent demonstration we’ve ever had that cloud-based systems are not infallible. We need to remember that cloud computing involves a huge degree of trust in other people to know what they’re doing and they may not get it right every time.
For hyper-critical systems, rather than being in ‘a’ cloud, it’s worth considering using more than one location or region to introduce greater resilience. This kind of redundancy may seem over the top, but our architecture should reflect the reality that things still can go wrong.
Above all, Facebook’s experience underlines the importance of good governance and hygiene. Get the basics right. It’s essential to make sure we can maintain business continuity in the event, however unlikely, that systems and services we can’t see or directly influence are interrupted.
The outage was caused by human error. It was a mistake during routine network maintenance that disconnected all the company’s global data centres simultaneously. I take this as a stark reminder of our responsibility as IT leaders to manage and support our people effectively. It’s salutary that many reports suggest the Facebook engineers concerned were working from home. A management challenge we’re all having to navigate and adjust to. Don’t get me wrong: the cloud is a wonderful thing and the benefits vastly outweigh any potential downsides. But if one of the world’s largest, richest and most powerful companies can be caught out, there’s no room for complacency.
Please contact us here to learn more about our services.