Friday, 14 July 2017

Pitfalls of system redundancy

Given the events of last night and today it is worth my writing up a bit about redundancy.

The way most things work in somewhat "industrial" IT is that you have two of everything (at least).

There are many ways this can work. There are things like VRRP which allows more than one device to be a router. There are things like LACP which allows more than one Ethernet port to work together as a bundle. Then there are things like BGP which allows more than one route with fallback.

Now, all of these work on a very simple logic - you have more than one bit of equipment or cable or fibre, and if one breaks you can carry on by using the other(s).

This is good, and this works. Mostly.

But there is a problem!

The problem is when something does not quite break!

All of these systems rely on the broken kit being dead, not working, same as turned off. And then it all detects the failure and falls back.

What if a switch is apparently working, all ports work, and some local communications work, and even some of the ports are passing traffic, but somehow some things are not passing some traffic?

That is bad, really bad. Not only do the fallback systems not realise, and they keep sending some the the traffic to the ill switch, but trying to identify the issue is a nightmare.

What happened today?

I have to say that we had some "special shit" today. At one point we had a case that I could not access router from my laptop but my colleague sat next to me could! I have two routers that were able to ARP each other, well actually one could ARP the other but not the other way around and they could not ping!

The usual tools like ping and traceroute to find the break in a network simply did not work!

We had links that allowed some traffic and not others, and that was mental.

The pain in the arse here is something called LACP. This works using two (or more) links, and we have a lot of it for the very purpose of redundancy. The LACP links pick an interface from a set using a "hash", typically of the IP addresses and maybe even the ports.

This means that some IP to IP traffic uses one link, and some uses another. And if one of those links is "ill" that will simply not work and drop the packet, and the BGP session, and all sorts!

A layer 2 problem...

The issue we ended up with was clearly a layer 2 issue, Ethernet. We had equipment only on that one switch that was not responding. We had weird issues when sending traffic to that switch with another on an LACP link. Basically, we ended up with a "half broken" switch.

A lot of CISCO debug (spell correct said "drug" not "debug" on that) was added and a reboot. It came back fine and is now working...

So do we have a switch with an intermittent partial failure? Is it hardware (self test says no)? Is it some hack and vulnerability? Who knows?

I am not sure we can fully explain the whole of the issues we have with that one switch being iffy, maybe at a stretch.

Next steps?

We need to make the switch less important (we have) and see if it fails again or something affecting other switches. It would be simple if this was a hardware issue, we swap out and/or repair.

The problem is that if this is some hack, then we have the same problem on other switches. We are not quite on latest code, but planned upgrades are in the pipeline already.

We have reports of people on TT retail that also dropped and maybe even BT? I am not convinced.

We have to wait and see.

What did we learn?

We went off on some tangents with this - the whole way normal dual redundancy works was not, well, "working". We had to try shutting things down to see what made it work. We even shut down some links that were a bad idea for hosted customers for a bit. Sorry.

We now know to look for the LACP related anomalies in this set up. We have found these in carrier networks before, but we simply were not used to this in our layer 2 network.

We learned a lot on how to extract info from the Cisco switches.

If it happens again we know where to look, and no, it is not to FireBrick, but to Cisco!

18 comments:

  1. Thanks. Very useful ... first thing I did was look at IRC to see 'if it was me'. I saw the activity and went off to my meeting happy that by the time I got back it would be OK. And it was. (well, I knew because I was getting texts!)

    ReplyDelete
  2. I did wonder if it was my end, but as I have redundant providers, I carried on with the other provider.

    At least AA will tell you when something goes wrong, rather than just blaming a local PC issue.

    Finally, these sorts of things is why Google pays a team to try and break Google, in pretty much anyway imaginable, to try and fix the unexpected before they have an outage. That they can just turn off a data centre if it gets a bit warm out is so impressive.

    ReplyDelete
    Replies
    1. Openness is very important - back in the day when Demon were the "ISP for techies" they were always very open about things they got wrong, even if it meant admitting that an outage was caused by someone unplugging the wrong thing. Then they got bought out and the new bosses seemed to consider admitting any failure to be bad PR. After a several episodes of spending hours debugging my own end only to find out that it was the ISP's fault (and they they almost certainly knew that all along but weren't telling anyone) I dropped them.

      I do wonder how many other customers "improving their PR image" cost them and whether it actually benefited them. That said, non-techies frequently seem to be mostly interested in finding someone to blame instead fo accepting that mistakes happen and you just have to do your best to fix things ASAP, so maybe it did...

      Delete
    2. It was worse than that, as I recall: they moved from claiming that any admission of problems was bad PR, to claiming, hilariously, that as a public company any problem with their service whatsoever constituted something that might affect the stock price, so *obviously* any status reports whatsoever had be written by lawyers and had to go through several days to weeks of legal checking before anyone outside the company was allowed to know about it.

      (Because, of course, there's no way anyone could possibly learn about any of their interminable total internal routing failures other than via a lawyer-vetted statement to shareholders. At least, not after everyone had got sick of this standard of mushroom-management "service" and stopped actually using them as an ISP, that is.)

      Delete
  3. Great to see continued openness from AA, I know how frustrating these types of issues can be to track down, all whilst under pressure.

    Few Q's
    - Did you manage to get a tac-pac before the reboot?
    - Are you running BFD over the port channel?
    - Have you looked at the ASIC/cabling layout to determine any points of commonality (N5K: show hardware internal carmel all-ports)
    - Any changes in traffic profile pre/post incident?
    - Is port hashing working well (e.g. are links in the bundle load sharing reasonably equally)?

    I'm sure you guys were in deeper than this but always good to cover the basics first...

    ReplyDelete
  4. > "So do we have a switch with an intermittent partial failure? Is it hardware (self test says no)? Is it some hack and vulnerability? Who knows?"

    Remind of the time I had a setup on my Mikrotik switch so LAN switch portgroup (ether1-19) and WAN switch portgroup (ether20-24) were completely separate from each other (even if using same VLAN ID) and I had loop protect screw up on the LAN switch portgroup but not the WAN so it took down my LAN but not WAN.

    I just rebooted the switch didn't even bother to use serial cable. Silly me.

    ReplyDelete
  5. Interesting. Thanks for the insight. Sounds similar to a Split-Brain scenario in a server cluster, for which Best Practice is generally to apply Fencing, STONITH or similar. I guess those practices are not common in the networking world - at least not yet.

    ReplyDelete
  6. There are protocols to detect these conditions.

    Cisco's proprietary one is UDLD, which all Cisco equipment should support. A newer, standardised equivalent is BFD, which newer or non-Cisco equipment should support.

    https://en.wikipedia.org/wiki/UDLD
    https://en.wikipedia.org/wiki/Bidirectional_Forwarding_Detection
    https://tools.ietf.org/html/rfc7130

    ReplyDelete
  7. Always good to see a breakdown of what happened and I really do feel your pain. I've had a few similarly mystifying half broken Cisco switches myself, although never that exact issue.

    Top marks for the switch that flooded instead of dropped traffic marked with VLANs it didn't have defined. It ate the STP packets from those VLANs correctly though...

    ReplyDelete
  8. We've seen various issues like this in the past - some hosts work fine, others not so much - where there has been packet loss on one leg of an LACP link. Packet loss rather than outright failure has meant that the usual methods of detecting link down didn't apply.

    What would be useful is a deterministic method of being able to send traffic down a specific link in an LACP group, for test purposes, on a (semi-)automated basis. I've yet to see such a feature.

    ReplyDelete
  9. Was one of the LACP members not carrying traffic (in one both directions) during this time?

    ReplyDelete
  10. It would be interesting to hear from your perspective, what, if anything, clients could have done to avoid being impacted. It sounds like customers on both BT and TT backhaul were affected so the redundancy of Office::1 won't have helped much. But what about your customers who bought your 3G/4G dongles - did their dongles work throughout? Anything else clients could have done?

    ReplyDelete
    Replies
    1. Part of Office::1 is mobile backup, with us or others, and mostly our mobile data was working.

      Delete
    2. 3G backup wasn't really working any better than DSL. My router was happily switching back and forth between the two but traffic was intermittent on both.
      I'd say its really only for cases where BT takes out your DSL line (as happened to me recently). Its unlikely to help when the ISP itself is broken.
      Mine also has an issue with failing over the static IP blocks. I have to tweak it manually from the control pages (something to do with the 3G logging in before the ADSL goes down) so that makes it less than useful anyway.

      Delete
    3. How about your clients on Ethernet-over-FTTC, Ethernet-over-Copper and Ethernet-over-Fibre? Were they impacted?

      Delete
  11. I've seen this failure model before, right pain in the rear end to identify... In this case a 6509 with a faulty DFC

    What model of switch is it?

    ReplyDelete
  12. we dont use Csico but even on HP switches LACP is a pain.

    ReplyDelete
    Replies
    1. We avoided cisco for a very long time, and now only on these high speed switches.

      Delete