Entanet, which supplies broadband and telecoms services to a number of ISPs and businesses across the United Kingdom, appears to have been hit by a serious network outage last night that resulted in a larger number of customers being left without internet access for several hours.
The problem, which first struck at around 5pm or 6pm last night, was initially claimed to have only affected “portions of [Entanets] customer base“, although many of Entanet’s ISP clients reported the incident and thus it appears to have hit a rather large portion.
Advertisement
The issue itself took roughly an hour to identify and Entanet then began to “gradually” restore their services, although some people were still reporting related connectivity problems this morning (probably because they needed to reboot their routers first).
Entanet’s Richard Partridge said this morning (9:35am):
“Between approximately 18.30hrs we began work to resolve the issue and by 20.45hrs all but two sites were reinstated successfully. The remaining sites were restored by approximately 21.30hrs and shortly after midnight. A detailed report is being prepared and will be published by 5:00pm today. We apologise for any inconvenience this issue may have caused.”
At the time of writing Entanet has still not said what caused the incident, although a recent service status update from London based colocation and server specialist Coreix might shed some light on the matter. According to Coreix, an investigation was opened at around 5:45pm after the ISP noticed “an inability to contact certain locations“.
Coreix Service Status (Network – IP Transit)
“We have completed our initial analysis of the network issue and it is as follows:
1. At approximately 5:45pm we noticed an inability to contact certain locations and started to investigate.
2. We disconnected LINX, LONAP and ENTA within 5 minutes as it appeared that there was an issue with one of these providers.
3. This did not solve this issue as anticipated and issue became worse.
4. At approximately 6:30pm the ENTA edge router to which we connect went down and full connectivity was reinstated (we where not connected to it at this point per point 2). It is our belief at this point in time that ENTA suffered from a serious route leak on their systems and was falsely advertising other companies routes. This caused the internet in general to send traffic to ENTA rather than Coreix.”
Entanet has promised to release an official statement later today and we shall update accordingly. Thankfully services should now be back to normal for everybody.
UPDATE 12:32pm
Advertisement
A second “network incident” affecting broadband connectivity, which appears to have been caused by a faulty line card, hit Entanet this morning at approximately 10:00am and meant that internet traffic had to be rerouted across “alternative paths“. Services are said to have been “stabilised” at around midday and the operator is currently working on a proper fix.
It goes without saying that the past 12-24 hours have been somewhat of a double whammy for Entanet and any of those affected.
UPDATE 1:48pm
It took awhile but Entanet has now published a full Incident Report (PDF) for the primary failure that occurred last night. A short-ish summary of the event follows.
Advertisement
Incident Report – Short Summary Extract
“At approximately 18.30hrs on 21st March 2013 we became aware of loss of service to a large number of customers, which appeared to be spread across all sites and services. Our systems and network team was immediately engaged and senior members of the team attended our Telford HQ to aid with the diagnostics and resolution. Initial investigations indicated that whilst we were able to see that the network was available internally and we could contact individual devices on the core, any layer 3 traffic was not flowing into and out of our network at any of our access points.
Our DSL based customers also experienced issues as they were also unable to pass traffic, or after rebooting their CPE, were unable to authenticate with our services. … The cause of the issue was traced at approximately 19.15hrs when our engineers identified a TCAM memory overrun in one of our core router logs. This indicated that the core routers were seeing excessive BGP routes, hence the overrun.
…
The immediate action taken to rectify this problem was to correct the edge BGP filters between our lab equipment and our core. This is now in place and will remain indefinitely. We are currently reviewing whether a double level of filter can be introduced as an extra layer of protection and will implement this when possible.”
The communications provider has also apologised for the “major incident” and pledged to undertake a review of their incident management, which could result in “the introduction of a logically separate NOC service and the transfer of our support desk to an alternative phone solution“.
Meanwhile the second WBC related outage that occurred this morning appears to have been resolved, although Entanet are still moving traffic around the related line card until they’re satisfied that the fix is working.
Comments are closed