Sunday, 5 February 2012

Upgrade progress (LINX and transit)

We have managed to move one transit and our LINX peering over now. We have all customers moved over on to the new LNSs, except the wholesale ones.

We even found why data SIMs were not showing graphs and sorted.

So has been a fun weekend.

Next week we get other transit feeds, other peering points, and a whole load of direct peering - which is going to take some co-ordinating.

The major jobs are sorted though, and all is looking very good.

You do then hit fiddly things like making sure nagios is watching the right boxes, and ensuring your cacti graphs are all running on the new boxes, and checking all the management LAN works, and the backup out-of-band access works, and the administration passwords are all set correctly with the right access lists. For the most part it is copy and paste, but you have to then test everything carefully just in case. A never ending set of silly little details.

At some point we want to go in there on a Sunday and check the dual power, which should be seamless. We also want to check that taking out a whole side of the network (turning off a switch) recovers. That will take some lines out for a few minutes we expect. We need to make a list of carefully defined tests and make sure people know we are buggering about.

Ideally, at some point, we should test turning off the whole power, and then back on, and seeing how quickly everything recovers. I am not sure if we will do that or not - it is a bit disruptive.

But if you don't test the contingencies they bite you when something does break.

We'll post details of what tests are being done when.

2 comments:

  1. Best to get NAGIOS and Cacti working before you move services over -- makes it easier to be aware of issues and then troubleshoot them when things don't work like they're supposed to.

    ReplyDelete
  2. indeed, but different levels of monitoring and reporting as we move things.

    ReplyDelete