
The American Content Delivery Network (CDN) and IT service company Cloudflare has committed to make several key changes in order to avoid breaking a significant chunk of the internet again, much as they did on two occasions between November (here) and, to a lesser extent, during early December 2025.
The biggest of the two events occurred on 18th November, when a huge chunk of the internet suddenly became sporadically inaccessible for several hours after Cloudflare pushed out a “wrong configuration” (i.e. a bug in generation logic for their Bot Management feature file) that “took down our network in seconds“.
Part of the problem stems from the difference between how Cloudflare deploys different types of updates. For example, when the company releases software version updates they do so in a controlled and monitored fashion. For each new binary release, the deployment must successfully complete multiple gates before it can serve worldwide traffic (e.g. deploying to staff traffic first and then a phased roll-out).
Advertisement
“If we detect an anomaly at any stage, we can revert the release without any human intervention,” said the company’s Chief Technical Officer, Dane Knecht, in a new blog (here). But Cloudflare doesn’t apply the same methodology to configuration changes, which are deployed instantly. “We give this power to our customers too: If you make a change to a setting in Cloudflare, it will propagate globally in seconds,” added Dane.
Cloudflare now acknowledges that the past two incidents have demonstrated that they “need to treat any change that is applied to how we serve traffic in our network with the same level of tested caution that we apply to changes to the software itself“. As a result, the provider has proposed to gradually make a series of changes to address this and to generally improve resilience, so that if an outage does occur again then it’s impact should be much less significant. All of this will fall under a new plan called: Code Orange: Fail Small.
Key Plans for Code Orange: Fail Small
➤ Require controlled rollouts for any configuration change that is propagated to the network, just like we do today for software binary releases.
➤ Review, improve, and test failure modes of all systems handling network traffic to ensure they exhibit well-defined behaviour under all conditions, including unexpected error states.
➤ Change our internal “break glass” procedures, and remove any circular dependencies so that we, and our customers, can act fast and access all systems without issue during an incident.
These projects aim to deliver iterative improvements as they proceed, rather than one “big bang” change at their conclusion. By the end of Q1 2026, Cloudflare expects to be in a position to ensure that all production systems are covered by Health Mediated Deployments (HMD) for configuration management (i.e. releasing config updates in the same way as software updates).
The company will also have updated its systems, by the same target date, to adhere to proper failure modes as appropriate for each product set and to ensure they have the processes in place, so the right people have the right access to provide proper remediation during an emergency.
Advertisement
“We understand that these incidents are painful for our customers and the Internet as a whole. We’re deeply embarrassed by them, which is why this work is the first priority for everyone here at Cloudflare,” said Dane Knecht.
Advertisement
Has Cloudflare got too big? When Cloudflare sneezes, the internet catches a cold…
Possibly.
But at least Cloudflare are brutally honest when they screw up, which is more than many of their competitors who get their marketing departments to make vague statements on what went wrong want what they might do to prevent it in future.
Cloudflare do care more than most and I would think that most of the internet would give them the few months they need to sort this out.
Most of the other cloud solutions are in the same boat, config changes do not necessarily get a gradual cadence of deployments.
Yes, definately. Much like the reliance on Amazon web services. Far too much control in such a small space.
This is the main reason why is my Cerberus Net Connect Portal Panel went blank page, they told me they have issues with it.
Why can’t they use it in United Kingdomn cloud dasebase instead of useless United States cloud database?
Spoken like someone who hasn’t the slightest clue about how technology works. Congrats.
“If builders built houses the way programmers build programs, the first woodpecker to come along would destroy civilization.” – Gerald Weinberg.
20 year old quote, still true to this day it seems.
I don’t think it has any relevance.
How many people could afford houses if each house had to be designed and built to survive each and every possible combination of natural disaster, attempted burglary, criminal damage, and owner’s terrible DIY? And all of those simultaneously?
Programmers build things, events and black hats try to break them. The last two have the advantage of infinite time whereas the programmer usually has some kind of release target.
if you ask one builder to build a house and then give a thousand burglars unlimited time to try and break in and steal your sausage rolls, who do you think wins?
The point is the Cloudflare outages were NOT caused by hackers, they were minor accidents that crashed the entire global system – much in the way you wouldn’t expect a woodpecker to land on a house instead of a tree.
Leaving aside the house analogies, it is probably fair to say that for every website that actually needs DDoS protection, there are many more that are never going to be a target and whose use of CF would only have decreased their overall availability.
But the industry from which the phrase “no one ever got fired for buying IBM” originated will unnecessarily continue to place eggs in other people’s baskets.
Who’d have thought that “configuration as code” meant you needed to test configuration changes the same way you test code changes? 🙂
I’ll wait to see if this is true.
Which is isn’t