Monday, 21 March 2016

Replacing switches

The first step in upgrading our network is replacing some of the core switches with new, much faster and more powerful, switches.

Replacing switches is always fun!

For a start, they are in pairs to try and ensure continued operation of at least some of the network if one was to fail. Where possible devices are connected to both switches, and where we have pools of devices they are spread between the two. We actually have some new changes in the pipeline that will allow more of our equipment to actually use link aggregation over two switches for better redundancy even.

So, to move to a new switch, what do you do?

Well, first off, and surprisingly, you have to make space - you need the new switches basically next to the old ones in the rack. This may not be obvious, but if you are moving cables from one switch to another you need to make the move as short as possible. If not, then you have to re-route the cables or even get longer cables. So you have to shuffle stuff up/down to make space. Thankful that worked well. You also have to check cables are going to be able to move, and none are too short or snagged on anything.

Then, you make sure the new switch is the same config as the old. This is not simple as switch configuration is far from standard. There are VLANs and jumbo frames and all sorts to check very carefully. A lot of double checking is needed.

You also configure the old and new switch so that all of the VLANs can link between them. This means you can plug the new switches in to the old ones.

Then, on the day, you move one cable at a time. Ideally, shutting down operations of what you are moving cleanly to fall back to other devices, and then move the cable, check it, re-enable the functions, and check that. One by one very carefully. Done right you can move a lot of things with no impact on service at all - pairs of BGP servers can cleanly switch over, move, and switch back. Some things have disruption like LNSs which cause traffic to reconnect to other LNSs when shut down.

There can be (and were) problems! Basically the old switches had a head fit after moving many of the cables! This makes no sense, and meant power cycling the damn things. And, of course, moving cables back. It was not pretty.

We have tried this twice, and the second time we have Talk Talk suffer a major issue as well which complicated matters so even reverting the changes left us with all TT lines off line for a couple of hours.

So, this time, on Thursday, new approach, called "big bang". The same careful config, and checking, but not linking the old switch, just carefully but quickly moving every cable to the new switch and then spending time checking each one. It will cause more issues than the more usual step by step approach (when it works), but it is pretty predictable that it should actually work this time. However, there will be a clear time limit and move all the cables back if we cannot get everything working within that time, in the middle of the night.

Good luck to the ops team doing this work...

2 comments:

  1. Sounds a bit better than when someone at work has to do a horrendous patching job, and then it turns out that there's a design flaw in there somewhere. Good luck figuring that one out! (Please don't ask us to check the same thing 5 times like one customer did; it's still correct just as it was the last few times)

    What's also fun with L1 troubleshooting is trying to trace mislabelled cables that go under the floor. By "fun", I really mean "not happening as it's out of scope for me".

    Then there's the time a customer sent through a blatantly incorrect config where the IP and gateway where in a different subnet, and this was obvious at a first glance since it helpfully was using /24. After mentioning that this needed fixing as it would never ever connect, I get a somewhat snooty response.

    The next day, the customer meekly emailed the corrected details. Turns out network engineers can spot the most basic config errors. Who knew!

    ReplyDelete
  2. It seems to have worked rather well? Line dropped for a bit, but not that long...

    ReplyDelete