Tuesday, 16 September 2014
In the line of fire
When working on FireBrick code, one of the final sanity checks after lots of bench testing and verification is to test the code on our own routers that are in front of the office and my home. Then we can release alpha code for people to test in the wild. This is a long way from beta code and finally the regular factory releases.
The office is a pretty demanding customer in that they provide, shall we say, "instant customer feedback" in the event that the code is not stable. Putting the code in the line of fire like this works well. It is very motivating for a s/w developer who is sat in the same room! But if we were not prepared to use this code ourselves, how could we expect customers to?
Whilst an office full of people that rely on the Internet for everything they do can be demanding, there is little that is as demanding than one of your kids playing LoL, especially if they are playing a league game!
I think every router manufacturer should try this challenge - can you s/w upgrade the single router/firewall in front of a LoL gamer in mid game without getting shouted at? That is the challenge!
Now, we know we have done well with the shutdown and startup sequence and the reboot logic that does not even re-set the Ethernet PHYs. We manage to go from shutdown to routing packets in a few hundred milliseconds. Combined with a few specific firewall and routing rules, I think I have got pretty close to the "Not screaming at me" threshold now. In testing he did notice a reboot, but only just, and not every time.
Related to this, we have many core routers in our network serving us, and ultimately serving customers. Can we re-load these with no impact on customers, and importantly - do we pass the scream test? Well, we have been working on the BGP and VRRP shutdown sequences lately. The concept is simple - using pairs or routers it should be possible to take one out cleanly for a re-load without dropping a single IP packet.
Before shutdown we announce lower priority BGP routes. We used to withdraw routes but that created gaps until adjacent routers propagate that and then get fed an alternative. This way they always have a route one way or the other. We do controlled VRRP handover. We then restart in a fraction of a second once that is all sorted and done cleanly. I think we have that sussed now.
It is amazing what you can do under pressure some times.