Home
 » ISP News » 
Sponsored

Fluidata ISP Boss Talks Recent UK Internet Outages and Redundancy

Saturday, July 23rd, 2016 (1:41 am) - Score 1,218

The Managing Director of ISP Fluidata, Piers Daniell, has said that “lessons need to be learned about the importance of uptime and mitigating failure” after a big chunk of the UK suffered disruption to Internet services following power failures at two major London datacentres (here and here).

The affected datacentres (Telecity and Telehouse) happened to be two of the most popular sites in the UK, which meant that even a relatively brief power failure to key hardware was able to cause significant disruption across the country. Many ISPs were hit by the failure, although BT arguably suffered most of all due to the size of their network and customer base.

At this stage it’s unclear whether BT had sufficient alternative hardware in the datacentres so as to fully and quickly mitigate the problem, but Daniell suggests that “more attention to how these networks are built and what underpins the services” is perhaps just as important a buying decision as getting a great commercial deal.

Piers Daniell said:

“Precaution is key, and whether BT had or had not in place sufficient alternative hardware in that datacentre, it did have other working datacentres, so lessons need to be learned about the importance of uptime and mitigating failure where possible. Customers, and especially businesses and ISPs need to understand the risks of not just their networks but upstream suppliers to mitigate total outages.

Datacentres for example run large UPS (uninterruptable power supplies) systems to cope with the switch from mains power to generators and while this provides a level of resilience, it needs to be checked and serviced regularly. Servicing and testing varies dramatically between datacentres so investing in smaller UPS systems for individual racks may therefore seem excessive but from experience it can provide a useful buffer should the worse happen.

Furthermore the number of backups and spares again goes someway to reinforce confidence. Gone are the days, in my mind at least, when datacentres can offer n+1 resilience (where ‘n’ is the required load and the +1 means an additional spare). So for example if a datacentre requires four generators to power the site then five would be installed. This is very different to a more resilient site that offers 2n where, for the same example, eight generators would be provided. The big issue with all of this of course is cost and this has a knock-on effect to the customer.”

Fluidata’s MD correctly notes that a lot of the hardware inside datacentres and networks in general will spend more or less its whole life turned on and failures are bound to happen. “No datacentre is impervious to disaster, as BT and others experienced, and while a whole site outage is very rare, the importance of having multiple datacentres is very important,” said Daniell.

As always it often comes down to a question of cost. Yes you can have multiple datacentres and lots of redundancy, although this does become very expensive and complex. But as Daniell concludes, it also means the “likelihood of complete outage is seriously reduced.”

We can’t help but wonder if Ofcom’s plan to introduce a new system of automatic compensation for consumers might also make the desire for extra redundancy into a bigger requirement than it is now. Much will of course depend upon the detail of the regulator’s proposed approach and whether the outages of last week would even count.

Share with Twitter
Share with Linkedin
Share with Facebook
Share with Reddit
Share with Pinterest
By Mark Jackson
Mark is a professional technology writer, IT consultant and computer engineer from Dorset (England), he also founded ISPreview in 1999 and enjoys analysing the latest telecoms and broadband developments. Find me on Twitter, , Facebook and Linkedin.
Leave a Comment
25 Responses
  1. Steve Jones says:

    I’d echo this, even if on my comments last time I was criticised for being unrealistic. With networks (especially) there are ways of building in more redundancy and geographical resilience.

    As to why two different (but key) datacentres were hit by similar failures two days apart, my suspicion is that it’s related to the hot weather.

    nb. the talk about the testing of auxiliary power generation switchover sounds like an obvious exercise, but I know of more than one instance where it went wrong and brought down a major datacentre. Those responsible for live systems are often very resistant to such exercises as they aren’t risk free.

    1. Gadget says:

      At least with a planned test (and good forward notification) it gives the users time to plan and maybe even arrange the tests of their own systems rather than get caught when the “real thing” happens.

    2. dragoneast says:

      I don’t think anyone was suggesting it was technically unrealistic. Obviously, it isn’t. It’s essential. But you’re a datacentre and you want to increase your prices to pay for it. Try that one on with BT. They can move their business elsewhere if they don’t like it. Yeah. There are big profits in broadband. Not, I suspect, for the pure infrastructure providers without any control over content. BT (and Sky, even TT and Voda, perhaps) aren’t stupid.

    3. Steve Jones says:

      Easier said then done with a centre that is running thousands of servers and applications with petabytes of storage with hideously complex interconnections of systems. Some systems take hours to bring back up and fully stabilise as the first thing that tends to happen is a flood of transactions which have built up during the downtime. For many systems, there is no such thing as true downtime.

      Hence there is a huge amount of effort goes in to trying to protect the environment, but sooner or later something will fail. In any event, opposition to tests and changes are often very high and it can take many weeks to get agreement.

    4. Ignition says:

      Not sure where the idea that there are big profits in broadband comes from. The margins are pretty thin across the board in the UK, which is why ISPs like line rental and content income.

    5. Peter J says:

      One of the problems with designing and operating systems with redundancy and standby power etc. is how do you ensure that they will work when there is a real failure. Often there is a reluctance to test a system by deliberately causing a failure in order to ensure that the system really does continue to work under such conditions. I recall a situation where a large standby generator was tested on a weekly basis but not by failing the incoming mains supply. There was a real mains failure and the generator failed to start because the starter battery was no longer floating on the mains charger. There is often a reluctance to test the system failure response under true failure conditions, in case it is the cause of an actual system failure, and then the provider gets the blame for actually causing the failure. However, causing a failure under controlled conditions, I would argue, is the only way to find out what really happens when you get a real fault.

    6. MikeW says:

      Of all the things needed for redundancy to work, power is the trickiest to be sure of.

      I recall one story of a data centre that was moving some of their UPSs, so any partial outage was thoroughly planned and monitored, with staff watching closely. And they still managed to get it wrong.

      There just seems to be so many ways for power to be linked oddly.

    7. MikeW says:

      @ignition, @dragon

      Broadband alone is not very profitable, a commodity (though if you’re a rural MP, it might not feel so).

      The profit comes from the value-add. The high availability, high guarantees, low contention, fast repair demands from business. Or selling them content services – audio and video collabaration, and cloud services. Or the ability to sell content services – video and audio entertainment – to residential. Or mobility.

      The question is how much of an ISPs budget from plain residential services can be expected to go on, not just “mitigating failure” but on ensuring zero downtime. Perhaps limiting the problem to 10% was actually a successful mitigation.

      I wonder how many subscribers who had bought robust service variants saw an impact?

    8. Ignition says:

      Transit and peering is the one place you would expect there to be resiliency. Compared to the rest of the links in the chain it’s pretty cheap.

      The interconnects with BT I find it quite believable some ISPs took non-resident options.

      BT have no excuse for not having multiple routes everywhere. Worst case it should fall back to transit so either the configuration was wrong or something malfunctioned.

      ISPs have no excuse for not using the resilient options for their 21CN feeds.

      There should not be a single point of failure ideally once the connections are at the exchange and at very least once they are at the first aggregation point.

      BTW screwed up. Some ISPs screwed up. There are 2 LINX LANs and rich interconnectivity between the physical POPs those LANs span for a reason.

    9. MikeW says:

      But when the main peering mechanisms loses one of those 2 LANs? And 20% of their traffic?

      I’d love to hear the outcome of the forensics, for sure.

    10. Ignition says:

      You lose one of the LANs you take another route. Either the other LAN, another exchange or worst case IP transit.

      If this doesn’t happen either the routes aren’t being advertised elsewhere, which is unlikely as it means the owner of the prefix has to peer with everyone, or are being filtered by the recipient.

      Traffic might have to go through one or more transit providers but it should have other paths. I get that the traffic would drop off LINX or shift to the other LAN but it should have elsewhere to go unless all routers advertising the prefixes were also impacted by the outage which seems unlikely.

      I suppose in the interests of saving money people might not only have their networks set up to prefer peering, but to simply not accept routes they think they should get from peering via transit partners.

      It would be dumb but it’s not impossible.

  2. FibreFred says:

    I’m not sure what piers is telling me here?

    He doesn’t know the facts around what happened, is he just saying fluidata wasn’t affected? If so I hope you are charging ad rates.

    1. Piers says:

      It maybe got lost in translation but we have feeds from BT in London AND Manchester which is expensive to support but meant when there were issues in London our services failed over to Manchester minimising downtime. BT do offer multiple datacentres but not everyone uses them, even those offering SLAs to their customers. The point on my blog was that outages can be avoided and that people shouldn’t assume a datacentre is impervious to outages.

  3. MikeW says:

    The question of automatic compensation is interesting here.

    As I understand things, the outage affected some people’s ability to reach some websites. Not a total outage for any individual subscriber, nor for any individual website.

    How can you write a meaningful compensation scheme for that?

    Or that the outage resulted in almost immediate remediation steps. What kind of availability guarantee do we expect to be given for “best efforts” internet access?

    I’m not sure that we’ll find the compensation scheme we end up with will ever kick in for circumstances like these.

    1. FibreFred says:

      No chance

      As I’ve said before, if you want an SLA, buy a product with an SLA, simple.

    2. Ignition says:

      It’s essentially impossible to guarantee an end user access to resources on a network you don’t own. An issue like this would need blame apportioning, and it would be unfeasible to isolate which customers were impacted.

    1. GNewton says:

      @TheFacts: Again, your link is misleading, and it’s back to your usual spreading of half-truths. The were at least 3 total power failures in different datacentres of Equinix in recent months, and yet BT has not learnt its lessons about implementing a more robust network redundancies.

    2. TheFacts says:

      It’s additional information, if you don’t like it add a comment there.

  4. kds says:

    It would be good to know to what actually happened, is it internet designed to work after nuclear attack ?

  5. Chris P says:

    @GNewton
    its important to note the outages affected BT’s third party providers equipment, not BT’s equipment. Its entirely possible BT’s network did failover but servers BT’s customers where trying to reach where unavailable as they where only reachable via the failed systems impacted by the power outage, therefore not an issue from BT but an issue with the providers of the remote systems connectivity robustness. Its all so easy to blame BT but you won’t know for sure unless you forensically dissect it.

    1. FibreFred says:

      That is of no interest to Gnewton (aka JNeuhoff) if there’s a chance to blame BT for something he’ll be all over it regardless of what actual facts are behind it.

      The power failure lies at the feet of Equinix , everything else in unknown to us as to why failovers didn’t work.

  6. a-poor-workman-allways-blames-his-suppliers says:

    Last weeks events merely confirmed what any sensible datacentre or network engineer knows, any components in a complex system can and will fail…

    Its irrelevant which elements fail, if you look at the fact that a number of networks of various sizes have equipment in the impacted locations but remained online then it would appear fairly obvious that only one network has for whatever reason not implemented their network with even basic N+1 resiliency never-mind the 2N which is commonly used to support SLAS’s of 5-nines and above…

  7. Chris P says:

    anyone that wants to truly mitigate against ISP failure simply uses 2 ISP’s and ensures they do not share infrastructure as far up the line as possible.
    For WAN’s you could use BT & Verizon and ensure full diversity.
    For Internet you could use the same but request 1 connects to LINX and the other to Amsterdam Internet Exchange.

    All this is extra cost in service, equipment and skilled staff, but what is the cost of not having that redundancy and resilience?

    all that doesn’t stop your potential customers / users from purchasing as cheap as possible ISP’s that do not have similar levels of redundancy ensuring they can still reach you.

    Globally big internet businesses have private links into ISP’s to ensure low latency and high availability to potential customers by circumventing internet exchanges.

    https://openconnect.netflix.com/en/

    1. GNewton says:

      @Chris P: I agree, good points. However, why would an end user have to go through all this trouble? Wouldn’t be up for an ISP to take care of this? E.g. use a mix of different data-centre companies and pay for different 3rd party routes for better redundancy solutions?

      BTW.: Please ignore FibreFred, he suffers from the symptoms of Prosopagnosia, also tends to call posters ‘trolls’ every week.

Comments are closed.

Comments RSS Feed

Javascript must be enabled to post (most browsers do this automatically)

Privacy Notice: Please note that news comments are anonymous, which means that we do NOT require you to enter any real personal details to post a message. By clicking to submit a post you agree to storing your comment content, display name, IP, email and / or website details in our database, for as long as the post remains live.

Only the submitted name and comment will be displayed in public, while the rest will be kept private (we will never share this outside of ISPreview, regardless of whether the data is real or fake). This comment system uses submitted IP, email and website address data to spot abuse and spammers. All data is transferred via an encrypted (https secure) session.

NOTE 1: Sometimes your comment might not appear immediately due to site cache (this is cleared every few hours) or it may be caught by automated moderation / anti-spam.

NOTE 2: Comments that break our rules, spam, troll or post via known fake IP/proxy servers may be blocked or removed.
Cheapest Ultrafast ISPs
  • Gigaclear £17.00 (*40.00)
    Speed: 200Mbps, Unlimited
    Gift: None
  • Community Fibre £20.00
    Speed: 150Mbps, Unlimited
    Gift: None
  • Virgin Media £25.00
    Speed: 108Mbps, Unlimited
    Gift: None
  • Vodafone £25.00
    Speed: 100Mbps, Unlimited
    Gift: None
  • Hyperoptic £25.00
    Speed: 150Mbps, Unlimited
    Gift: None
Large Availability | View All
Cheapest Superfast ISPs
  • Hyperoptic £17.99
    Speed 30Mbps, Unlimited
    Gift: None
  • NOW £21.00
    Speed 36Mbps, Unlimited
    Gift: None
  • Shell Energy £21.99
    Speed 35Mbps, Unlimited
    Gift: None
  • Vodafone £22.00
    Speed 38Mbps, Unlimited
    Gift: None
  • Plusnet £22.99
    Speed 36Mbps, Unlimited
    Gift: £60 Reward Card
Large Availability | View All
The Top 20 Category Tags
  1. FTTP (4004)
  2. BT (3128)
  3. Politics (2085)
  4. Building Digital UK (2007)
  5. Openreach (1948)
  6. FTTC (1915)
  7. Business (1801)
  8. Mobile Broadband (1584)
  9. Statistics (1483)
  10. FTTH (1369)
  11. 4G (1357)
  12. Virgin Media (1263)
  13. Ofcom Regulation (1225)
  14. Fibre Optic (1220)
  15. Wireless Internet (1218)
  16. Vodafone (917)
  17. EE (896)
  18. 5G (870)
  19. TalkTalk (812)
  20. Sky Broadband (781)
Promotion
Helpful ISP Guides and Tips
»
»
»
»
»
»
»
»
»
»
»
»
»
»
»
»
»
»
»
»
»
»
»
»
»
»
»
»
»
Sponsored

Copyright © 1999 to Present - ISPreview.co.uk - All Rights Reserved - Terms , Privacy and Cookie Policy , Links , Website Rules , Contact