Today we (A&A) had another brief outage impacting broadband, ethernet and hosted customers, and VoIP. It was a bit complicated as it was one side of an LACP and so probably half of things were working and half not, and it looks like pretty much all broadband went down.
It was an error on our part - the ops team have been working hard all week, and working with a consultant, to help us investigate last week's issues with the CISCO switches. They have done a number of changes (adding more logging, etc) and diagnostics during the week. At each stage they have to assess the risk and decide if they can go ahead or wait until evening or even over night. A change today to bring back one of the links between the London data centres (one shut down on Friday) so we can test it independently of the normal operation resulted in breaking the switch links. Even the consultant thought it would be OK.
I think I can elaborate a tad more on things we know. I am sure the ops team will shout if I have misunderstood. At this stage there are aspects of what happened that are still unclear. This means we are adding some "defensive" config to try and address possible causes for the future.
The main issue, it now seems, was that the BGP links to all of our carriers from all of our switches all failed at the same time. Yeh, so much for redundancy! These are private links on a separate VRF and not connected to other BGP. The BGP is with routers on the end of locally connected single fibre links (of which we have many), not LACP or anything complicated. So the failure has to be entirely within the cisco switches. We can almost certainly rule out hardware impacting all at once. Also, being on separate VRF and not seeing Internet traffic at all, it seems unlikely some attack from outside. This leads us with the possibility of some sort of unstable config on the switches, maybe something spanning tree related (I hate spanning tree), or maybe some BGP issue with routes received from carriers, which seems unlikely, but maybe not impossible. So there is a lot of careful review of things like BGP filters from carriers, and spanning tree config, and so on.
The "fix" was rebooting half a dozen cisco switches. On Thursday this worked, but it took some time to conclude that was a sane thing to do, when other options were exhausted.
As I am sure you can appreciate, just "turning it off and back on again", or rebooting the switches, really is a last resort. We have highly skilled engineers who spent some time trying to diagnose the actual issue before taking such a step, and that is one reason these issues can take some time to fix. Sometimes a reboot can fail to solve anything but lose valuable clues.
On Friday that worked too, again we tried to understand the issue first, and got a lot more information. The reboots seems to have triggered a second issue with one of the switches being stupid (as per my other blog post) and coming up in a half broken state. Rebooting that one switch again sorted it. It is almost unheard of to have two different issues like this, one after the other, and that really threw us as well.
A lot of this week has been understanding the way the cisco switches are set up in much more detail, and adding more logging, and updating processes so we have a better idea what to do if it ever happens again - both fixing things more quickly, and finding more clues as to the cause. It may be that we have mitigated the risk of it happening by the changes being done. We hope so.
Obviously this sort of thing is pretty devastating - I am really unhappy about this, and really sorry for the hassle it has caused customers.
As I say, it is not really like the "good old days" when BT would have a BRAS crash pretty much every day. These days we expect more, and our customers expect more.
So, please do accept my apologies for the ongoing issues, and my reassurance that they are being taken very seriously.