Thursday, 20 July 2017

It is almost like the good old days, err...

Today we (A&A) had another brief outage impacting broadband, ethernet and hosted customers, and VoIP. It was a bit complicated as it was one side of an LACP and so probably half of things were working and half not, and it looks like pretty much all broadband went down.

It was an error on our part - the ops team have been working hard all week, and working with a consultant, to help us investigate last week's issues with the CISCO switches. They have done a number of changes (adding more logging, etc) and diagnostics during the week. At each stage they have to assess the risk and decide if they can go ahead or wait until evening or even over night. A change today to bring back one of the links between the London data centres (one shut down on Friday) so we can test it independently of the normal operation resulted in breaking the switch links. Even the consultant thought it would be OK.

I think I can elaborate a tad more on things we know. I am sure the ops team will shout if I have misunderstood. At this stage there are aspects of what happened that are still unclear. This means we are adding some "defensive" config to try and address possible causes for the future.

The main issue, it now seems, was that the BGP links to all of our carriers from all of our switches all failed at the same time. Yeh, so much for redundancy! These are private links on a separate VRF and not connected to other BGP. The BGP is with routers on the end of locally connected single fibre links (of which we have many), not LACP or anything complicated. So the failure has to be entirely within the cisco switches. We can almost certainly rule out hardware impacting all at once. Also, being on separate VRF and not seeing Internet traffic at all, it seems unlikely some attack from outside. This leads us with the possibility of some sort of unstable config on the switches, maybe something spanning tree related (I hate spanning tree), or maybe some BGP issue with routes received from carriers, which seems unlikely, but maybe not impossible. So there is a lot of careful review of things like BGP filters from carriers, and spanning tree config, and so on.

The "fix" was rebooting half a dozen cisco switches. On Thursday this worked, but it took some time to conclude that was a sane thing to do, when other options were exhausted.

As I am sure you can appreciate, just "turning it off and back on again", or rebooting the switches, really is a last resort. We have highly skilled engineers who spent some time trying to diagnose the actual issue before taking such a step, and that is one reason these issues can take some time to fix. Sometimes a reboot can fail to solve anything but lose valuable clues.

On Friday that worked too, again we tried to understand the issue first, and got a lot more information. The reboots seems to have triggered a second issue with one of the switches being stupid (as per my other blog post) and coming up in a half broken state. Rebooting that one switch again sorted it. It is almost unheard of to have two different issues like this, one after the other, and that really threw us as well.

A lot of this week has been understanding the way the cisco switches are set up in much more detail, and adding more logging, and updating processes so we have a better idea what to do if it ever happens again - both fixing things more quickly, and finding more clues as to the cause. It may be that we have mitigated the risk of it happening by the changes being done. We hope so.

Obviously this sort of thing is pretty devastating - I am really unhappy about this, and really sorry for the hassle it has caused customers.

As I say, it is not really like the "good old days" when BT would have a BRAS crash pretty much every day. These days we expect more, and our customers expect more.

So, please do accept my apologies for the ongoing issues, and my reassurance that they are being taken very seriously.

Adrian
Director, A&A

8 comments:

  1. I appreciate the transparency.

    What is the state of play with the vendor? did you submit dumps to TAC and raise ticket(s)? also are all six boxes running the same NXOS release and applied SMU bundles?

    I would suggest detailing the problem in a mail to both the UKNOF and c-nsp list, importantly including which NXOS release you are running, others are likely to of seen this type of thing before.

    Many helpful Cisco techs hangout on the c-nsp list - https://puck.nether.net/mailman/listinfo/cisco-nsp

    ReplyDelete
    Replies
    1. My understanding is that we were not up to speed on doing the dumps and dealing with TAC - which is part of the training this week. Using CISCO at all is a but new for us and we use these almost entirely as fast switches. The BGP is a small bit we use and we thought we understood how it worked, but clearly we need to understand more.

      Delete
  2. Are you running bgp on the Cisco switches?

    ReplyDelete
  3. Shit happens as they say, and its always good to test my backup connection every now and then 😀

    ReplyDelete
  4. The thing is, this is exactly why we choose to use AA, because when things go wrong, you will be open and honest and no bullshit, because you have staff that know what they are doing, and give a damn and because the company has the right attitude. You simply can't ask for more than that. Truly zero downtime is probably a fantasy. But most of those good attributes are in fairly short supply and finding several of them or even all of them in a single company is nearly unheard of unfortunately.

    ReplyDelete
  5. I've been trying to educate my employer that your service is worth the extra they would have to pay. One nice surprise was that when I told them about last weeks issues and showed them the text messages etc they were actually most impressed. When the cheap system fails they all spend hours trying to figure out what went wrong, Your openness and information sharing may have brought you a new customer.

    ReplyDelete