RevK®'s ramblings: Pitfalls of system redundancy

2017-07-14

Pitfalls of system redundancy

Given the events of last night and today it is worth my writing up a bit about redundancy.

The way most things work in somewhat "industrial" IT is that you have two of everything (at least).

There are many ways this can work. There are things like VRRP which allows more than one device to be a router. There are things like LACP which allows more than one Ethernet port to work together as a bundle. Then there are things like BGP which allows more than one route with fallback.

Now, all of these work on a very simple logic - you have more than one bit of equipment or cable or fibre, and if one breaks you can carry on by using the other(s).

This is good, and this works. Mostly.

But there is a problem!

The problem is when something does not quite break!

All of these systems rely on the broken kit being dead, not working, same as turned off. And then it all detects the failure and falls back.

What if a switch is apparently working, all ports work, and some local communications work, and even some of the ports are passing traffic, but somehow some things are not passing some traffic?

That is bad, really bad. Not only do the fallback systems not realise, and they keep sending some the the traffic to the ill switch, but trying to identify the issue is a nightmare.

What happened today?

I have to say that we had some "special shit" today. At one point we had a case that I could not access router from my laptop but my colleague sat next to me could! I have two routers that were able to ARP each other, well actually one could ARP the other but not the other way around and they could not ping!

The usual tools like ping and traceroute to find the break in a network simply did not work!

We had links that allowed some traffic and not others, and that was mental.

The pain in the arse here is something called LACP. This works using two (or more) links, and we have a lot of it for the very purpose of redundancy. The LACP links pick an interface from a set using a "hash", typically of the IP addresses and maybe even the ports.

This means that some IP to IP traffic uses one link, and some uses another. And if one of those links is "ill" that will simply not work and drop the packet, and the BGP session, and all sorts!

A layer 2 problem...

The issue we ended up with was clearly a layer 2 issue, Ethernet. We had equipment only on that one switch that was not responding. We had weird issues when sending traffic to that switch with another on an LACP link. Basically, we ended up with a "half broken" switch.

A lot of CISCO debug (spell correct said "drug" not "debug" on that) was added and a reboot. It came back fine and is now working...

So do we have a switch with an intermittent partial failure? Is it hardware (self test says no)? Is it some hack and vulnerability? Who knows?

I am not sure we can fully explain the whole of the issues we have with that one switch being iffy, maybe at a stretch.

Next steps?

We need to make the switch less important (we have) and see if it fails again or something affecting other switches. It would be simple if this was a hardware issue, we swap out and/or repair.

The problem is that if this is some hack, then we have the same problem on other switches. We are not quite on latest code, but planned upgrades are in the pipeline already.

We have reports of people on TT retail that also dropped and maybe even BT? I am not convinced.

We have to wait and see.

What did we learn?

We went off on some tangents with this - the whole way normal dual redundancy works was not, well, "working". We had to try shutting things down to see what made it work. We even shut down some links that were a bad idea for hosted customers for a bit. Sorry.

We now know to look for the LACP related anomalies in this set up. We have found these in carrier networks before, but we simply were not used to this in our layer 2 network.

We learned a lot on how to extract info from the Cisco switches.

If it happens again we know where to look, and no, it is not to FireBrick, but to Cisco!

18 comments:

BobFriday, 14 July 2017 at 19:40:00 BST
Thanks. Very useful ... first thing I did was look at IRC to see 'if it was me'. I saw the activity and went off to my meeting happy that by the time I got back it would be OK. And it was. (well, I knew because I was getting texts!)
ReplyDelete
Replies
Algernon J ForthcummingFriday, 14 July 2017 at 20:10:00 BST
I did wonder if it was my end, but as I have redundant providers, I carried on with the other provider.

At least AA will tell you when something goes wrong, rather than just blaming a local PC issue.

Finally, these sorts of things is why Google pays a team to try and break Google, in pretty much anyway imaginable, to try and fix the unexpected before they have an outage. That they can just turn off a data centre if it gets a bit warm out is so impressive.
ReplyDelete
Replies
AnonymousFriday, 14 July 2017 at 20:13:00 BST
Great to see continued openness from AA, I know how frustrating these types of issues can be to track down, all whilst under pressure.

Few Q's
- Did you manage to get a tac-pac before the reboot?
- Are you running BFD over the port channel?
- Have you looked at the ASIC/cabling layout to determine any points of commonality (N5K: show hardware internal carmel all-ports)
- Any changes in traffic profile pre/post incident?
- Is port hashing working well (e.g. are links in the bundle load sharing reasonably equally)?

I'm sure you guys were in deeper than this but always good to cover the basics first...
ReplyDelete
Replies
JTLFriday, 14 July 2017 at 20:46:00 BST
> "So do we have a switch with an intermittent partial failure? Is it hardware (self test says no)? Is it some hack and vulnerability? Who knows?"

Remind of the time I had a setup on my Mikrotik switch so LAN switch portgroup (ether1-19) and WAN switch portgroup (ether20-24) were completely separate from each other (even if using same VLAN ID) and I had loop protect screw up on the LAN switch portgroup but not the WAN so it took down my LAN but not WAN.

I just rebooted the switch didn't even bother to use serial cable. Silly me.
ReplyDelete
Replies
dMbFriday, 14 July 2017 at 21:05:00 BST
Interesting. Thanks for the insight. Sounds similar to a Split-Brain scenario in a server cluster, for which Best Practice is generally to apply Fencing, STONITH or similar. I guess those practices are not common in the networking world - at least not yet.
ReplyDelete
Replies
AnonymousFriday, 14 July 2017 at 21:30:00 BST
There are protocols to detect these conditions.

Cisco's proprietary one is UDLD, which all Cisco equipment should support. A newer, standardised equivalent is BFD, which newer or non-Cisco equipment should support.

https://en.wikipedia.org/wiki/UDLD
https://en.wikipedia.org/wiki/Bidirectional_Forwarding_Detection
https://tools.ietf.org/html/rfc7130
ReplyDelete
Replies
IanFriday, 14 July 2017 at 22:53:00 BST
Always good to see a breakdown of what happened and I really do feel your pain. I've had a few similarly mystifying half broken Cisco switches myself, although never that exact issue.

Top marks for the switch that flooded instead of dropped traffic marked with VLANs it didn't have defined. It ate the STP packets from those VLANs correctly though...
ReplyDelete
Replies
ajvFriday, 14 July 2017 at 23:02:00 BST
We've seen various issues like this in the past - some hosts work fine, others not so much - where there has been packet loss on one leg of an LACP link. Packet loss rather than outright failure has meant that the usual methods of detecting link down didn't apply.

What would be useful is a deterministic method of being able to send traffic down a specific link in an LACP group, for test purposes, on a (semi-)automated basis. I've yet to see such a feature.
ReplyDelete
Replies
Frank BulkSaturday, 15 July 2017 at 05:12:00 BST
Was one of the LACP members not carrying traffic (in one both directions) during this time?
ReplyDelete
Replies
AnonymousSaturday, 15 July 2017 at 18:24:00 BST
It would be interesting to hear from your perspective, what, if anything, clients could have done to avoid being impacted. It sounds like customers on both BT and TT backhaul were affected so the redundancy of Office::1 won't have helped much. But what about your customers who bought your 3G/4G dongles - did their dongles work throughout? Anything else clients could have done?
ReplyDelete
Replies
UnknownSaturday, 15 July 2017 at 19:13:00 BST
I've seen this failure model before, right pain in the rear end to identify... In this case a 6509 with a faulty DFC

What model of switch is it?
ReplyDelete
Replies
big DMonday, 17 July 2017 at 15:22:00 BST
we dont use Csico but even on HP switches LACP is a pain.
ReplyDelete
Replies

Add comment

Comments are moderated purely to filter out obvious spam, but it means they may not show immediately.

RevK^®'s ramblings

2017-07-14

Pitfalls of system redundancy

18 comments:

Moving from FDM to resin 3D

Rules

Rules

Report Abuse