
Customers of various UK broadband ISPs and mobile operators are today reporting that they’re unable to access major online newspapers, government and other websites, with many facing the “Error 503 Service Unavailable” message. Fingers are currently being pointed at Amazon’s cloud services and another cloud provider, Fastly.
At the time of writing, we’re noticing that some sites, such as those for the UK Government, are starting to become accessible again after being unreachable for – in some cases – almost an hour.
A quick look on the Service Status page for Fastly’s Content Delivery Network (CDN) service shows that they began investigating a “potential impact to performance with our CDN services” at 9:58am UTC (10:58am UK / BST time). The latest update at 10:44am UTC (11:44am BST) states that “the issue has been identified and a fix is being implemented.”
Advertisement
The Fastly CDN is used by lots of major websites, such as the New York Times, Vimeo, Twitch, Reddit, UK Government and many more.
We are aware of the issues with https://t.co/uLPSBt4jdQ which means that users may not be able to access the site. This is a wider issue affecting a number of other non-government sites. We are investigating this as a matter of urgency.
— GOV.UK (@GOVUK) June 8, 2021
UPDATE 12:00 midday
Fastly reports that their customers may “experience increased origin load as global services return.”
UPDATE 12:46pm
Advertisement
We’ve had a few comments come in on this.
Toby Stephenson, CTO at Cyber Security firm Neuways, said:
“This incident highlights the reliance of many of the world’s biggest websites on content delivery networks (CDNs) such as Fastly. As there are so few of these CDN services, these outages can occur from time-to-time. By using these CDNs to push content to readers, these websites are usually fast and responsive, but on this occasion they have been left with egg on their collective faces. The technical backends of these big websites are probably fine, but it is the frontends that can’t be accessed and content cannot be pushed as the network is down.”
Gaz Jones, Technical Director of Think3, added:
“Fastly CDN had major problems affecting Stack Overflow, Spotify, Stripe, Gov.uk and GitHub among others. This is what happens when half of the internet relies on Goliaths like Amazon, Google and Fastly for all of its servers and web services. The entire internet has become dangerously geared on just a few players.”
UPDATE 1:43pm
Fastly has confirmed that the issue was caused by an unspecified “service configuration” issue “that triggered disruption across our PoPs (Points of Presence) globally.” The configuration change has been disabled and services are now fully restored.
5 minutes ago, pretty much all images disappeared from amazon.co.uk too. I was browsing the site and all listings were coming up without images.
Cloud, always a problem.
Most failures occur in ‘Cloud’ because very little is running on single boxes these days. They are energy efficient, cost effective when used correctly and robust so I’m sure we’d have more problems if it was done ‘the old way’.
This failure doesn’t appear to be an issue with the Cloud though, it’s a configuration issue so likely an engineer has deployed something through their CDN service which caused the outage (wouldn’t like to be that person!).
How is it that major organisations do not appear to have a multi-CDN strategy which reroutes traffic and thus maintains a service to its users?
How would you do that?.. By centralising your DNS to one provider instead for the detection and failover to work!
We do CDN load balancing at work. The CNAME to your CDN can do round robin between multiple CDN providers, or you can use a ratio and send X percent to one CDN and Y percent to another CDN.. or even use it as a failover.
There are many tools to do it, like load balancers (e.g. F5 GTM) and many DNS providers allow load balancing of records with weighting/priority settings. Even our fairly small website has a backup CDN on Cloudfront and the main one on Limelight.
It entirely depends on whether the cost of an outage is greater than the cost of removing the risk of that outage. Removing the risk of losing your CDN for a couple of hours once in a decade would be a tough business case to get past a board.
The problem is that you don’t know if it’s going to be a couple of hours once in a decade. Cloudfront had a failure in 2019 and 2020, and AWS one in 2020 as well, with both the 2020 failures taking out much of the web. It’s surely a major problem for organisations which are very reliant on the web, and it cannot be right for almost all traffic to be processed by just three CDN providers.
Openreach’s “when and where” fibre map was affected by this earlier today: it was getting 503 errors when pulling tiles from Openstreetmap. Working now though.
Web3 and Blockchain based CDNs will help solve these kinds of outages. With ipfs and other methods to approach Contend delivery and FS Storage plus with a Decentralised type of topology. Soon someday soon we can move technology along and not have all our WWW wrapped up in a handful of providers.