Given the events of last night and today it is worth my writing up a bit about redundancy.
The way most things work in somewhat "industrial" IT is that you have two of everything (at least).
There are many ways this can work. There are things like VRRP which allows more than one device to be a router. There are things like LACP which allows more than one Ethernet port to work together as a bundle. Then there are things like BGP which allows more than one route with fallback.
Now, all of these work on a very simple logic - you have more than one bit of equipment or cable or fibre, and if one breaks you can carry on by using the other(s).
This is good, and this works. Mostly.
But there is a problem!
The problem is when something does not quite break!
All of these systems rely on the broken kit being dead, not working, same as turned off. And then it all detects the failure and falls back.
What if a switch is apparently working, all ports work, and some local communications work, and even some of the ports are passing traffic, but somehow some things are not passing some traffic?
That is bad, really bad. Not only do the fallback systems not realise, and they keep sending some the the traffic to the ill switch, but trying to identify the issue is a nightmare.
What happened today?
I have to say that we had some "special shit" today. At one point we had a case that I could not access router from my laptop but my colleague sat next to me could! I have two routers that were able to ARP each other, well actually one could ARP the other but not the other way around and they could not ping!
The usual tools like ping and traceroute to find the break in a network simply did not work!
We had links that allowed some traffic and not others, and that was mental.
The pain in the arse here is something called LACP. This works using two (or more) links, and we have a lot of it for the very purpose of redundancy. The LACP links pick an interface from a set using a "hash", typically of the IP addresses and maybe even the ports.
This means that some IP to IP traffic uses one link, and some uses another. And if one of those links is "ill" that will simply not work and drop the packet, and the BGP session, and all sorts!
A layer 2 problem...
The issue we ended up with was clearly a layer 2 issue, Ethernet. We had equipment only on that one switch that was not responding. We had weird issues when sending traffic to that switch with another on an LACP link. Basically, we ended up with a "half broken" switch.
A lot of CISCO debug (spell correct said "drug" not "debug" on that) was added and a reboot. It came back fine and is now working...
So do we have a switch with an intermittent partial failure? Is it hardware (self test says no)? Is it some hack and vulnerability? Who knows?
I am not sure we can fully explain the whole of the issues we have with that one switch being iffy, maybe at a stretch.
We need to make the switch less important (we have) and see if it fails again or something affecting other switches. It would be simple if this was a hardware issue, we swap out and/or repair.
The problem is that if this is some hack, then we have the same problem on other switches. We are not quite on latest code, but planned upgrades are in the pipeline already.
We have reports of people on TT retail that also dropped and maybe even BT? I am not convinced.
We have to wait and see.
What did we learn?
We went off on some tangents with this - the whole way normal dual redundancy works was not, well, "working". We had to try shutting things down to see what made it work. We even shut down some links that were a bad idea for hosted customers for a bit. Sorry.
We now know to look for the LACP related anomalies in this set up. We have found these in carrier networks before, but we simply were not used to this in our layer 2 network.
We learned a lot on how to extract info from the Cisco switches.
If it happens again we know where to look, and no, it is not to FireBrick, but to Cisco!