Set the code free!

Some times it is so hard not to just go and update everything to the latest release.

We have spent weeks getting the code ready to release, with the completely new TCP stack from scratch finally being deployed, and lots of testing, and finally have a factory release candidate.

TCP is only needed for some of the internal features, like web interface and BGP and so on, but there is no reason not to make it a good quality stack with proper fast recovery and window scaling and syn cookies and everything else you expect. We have literally speed tested BGP full table transfers between boxes and seen it as low as 6 seconds for over 500k IPv4 routes (though it is currently around twice that when it actually bothers to have an Rx window that is not infinite). We have gradually added the new TCP features and tested, and even had some TCP session leaks we had to plug. The original TCP stack was my code when we first started and has lasted very well for many years, but Cliff has re-done it from scratch and massively improved it.

Obviously, I am 100% confident in the new code. I would have to be as otherwise it would not be a release candidate. I am confident enough to deploy on the A&A network.

But I have to face the reality that there may be something that breaks, and we have to test, test, test, and test again.

So, once again, it is a beta release loaded only on to some of the routers, and over night we will load on to just one of the LNSs, so everything has a hot standby just in case and everything is easy to roll back. If there are problems we can sort them quickly.

I know the past says "all code has bugs", but the future always looks so bug free. Often the code released is indeed, for all practical purposes, bug free. We try to do that every time, and manage it a lot. But it is still frustrating having to give it time to go wrong, just in case.

I feel like I can't do anything else - it is like waiting for a baby to be born - you just have to wait until it is finally due.

We were, none the less, tested on one of the recent alpha releases that survived maybe an hour outside the lab before it was hit by bad checksum bogus SYN packets once in the wild and crashed. That was pulled very quickly. It is amazing how a box on the Internet has to face so much crap. At the end of the day there is only so much testing you can do in the lab, and you have to let your code go free and out in to the wild for fend for itself and just cross your fingers.

Beta should be made in to a new factory release next week. That is when we really release it in to the wild :-)

No comments:

Post a Comment

Comments are moderated purely to filter out obvious spam, but it means they may not show immediately.

TOTSCO moving goal posts, again!

One of the big issues I had in initial coding was the use of correlationID on messages. The test cases showed it being used the same on a se...