Saturday, 25 May 2013

Unexpected night shift

Well, I ended up working over night, so it is nice that people on irc are saying "what action?" when someone said they slept through all the action.

In practice customers will have seen no more of a problem than a PPP restart or two during the early hours, but for some of us it was quite a busy night.

We did one thing wrong, over 4 years ago. It seems some code that was initially just a bit of a test, was then incorporated in to our TCP stack. The code did not have the same attention to detail that we normally expect. That was the mistake. Sadly mistakes can happen but we are going through the code again now to see if any other mistakes exist like this. I won't post the exact details here yet as there are a few people that still need to do a planned upgrade. A mistake like this would normally be picked up when it is written and tested, but once an alpha is put on to the live Internet you expect any issues to be picked up pretty quickly. It is one of the reasons we have extensive pre-release testing.

At around 00:36 there was a pretty concerted scan/attack that exploited this specific mistake. We really can't imagine someone targeted FireBrick kit specifically - if they did, then that is a sign we have got big enough to be a target, and I really doubt that. That means there is other kit out there with the same mistake. Interestingly, during the attack, which lasted until around 6am, we saw several unexplained "blips" which appear to be within BT's network, so maybe there is other kit that had issues as well or maybe BT were just doing some work. I'll be interested to hear if other ISPs had problems as the attack did not seem to be just A&A IPs. Even though the bug has been there for over 4 years, this is the first time we have seen the specific (invalid) packets that cause these problems.

So that was the mistake, what did we do right?

The FireBrick code is pretty defensive. For a start, the bug was picked up by the software watchdog. Had it not been, the hardware watchdog would have picked it up.

The watchdog caused the FireBrick to restart. Unlike many devices that take minutes to restart a FireBrick is back as soon as the ports negotiate, which is a couple of seconds.

The watchdog/restart causes the FireBrick to email the support team automatically. This is the default (you can turn it off). We are all used to network appliances freezing, locking up, crashing, and needing rebooting - but not FireBricks. If they crash we get an email and we look in to it as a matter of course. Even with hundreds of FireBricks out there we don't get many emails, and those we do are usually from people choosing to run test/alpha code. But the email includes details of what exactly happened, and normally allows us to pinpoint the problem within seconds.

So, at 00:36 we start getting crash report emails. We ended up with over 200 from FireBricks all over the country on different ISPs and networks. This is where we end up working all night!

We quickly identified that there was clearly a problem, and the crash logs pinpointed the cause. We had just released a beta version that was being tested ready to be made a factory release, but the crash logs confirmed that it was not the cause. All versions of code out there were crashing, and all current makes of FireBrick. We found the cause in the TCP code, made the change, with two people reviewing the change carefully even though it was only one line, and made a new release which was tested and issued by 04:11.

Given the severity of this issue and the fact that the attack was on-going, I made the decision to release this as a factory issue immediately. We had been running the current code without this patch on most A&A routers for a while so this seemed a safe bet. We upgraded all of the A&A LNSs and routers (over 20 FireBricks), and by 5am we had everything stable. Because of the factory release, any FB2500 or FB2700 that crashed would immediately pick up the new software and be fixed. Over the next 24 hours all FB2500 and FB2700's would update themselves (unless specifically disabled in the config).

We spent the next hour checking the operation of all of the upgraded FireBricks in A&A, all seemed well, but by 06:30 we realised that just one of the A&A boxes - the main office firewall, was not quite right. It was working, but it had an issue with its Firewall session table. This was one of the recent changes added before the beta release, and is exactly the sort of thing we expected to pick up by having a beta release. The TCP issue had meant we promoted that to a factory release a bit too quickly. A new release was issued for FB2500 and FB2700 (the firewalling models) by 06:44.

Obviously the FireBrick announcement mailing list has been sent full details.

We are monitoring the situation now, obviously.

Whilst we did make a mistake, 4 years ago, I think we managed to get a lot of other things right over this and show that we can react very quickly when there is an issue, even at silly O'Clock in the morning.

I'll be on irc most of the day if anyone has any questions.

4 comments:

  1. Great work and an example of why it is worth spending more on a product like a Firebrick rather than a cheaper product which may take months to get patched, if at all.

    ReplyDelete
  2. btw, the bit at the end makes me wonder if you might not want separate beta and release branches, so that you can promote emergency changes to the release branch (and thence production) without necessarily also forcing out all the not-fully-baked beta changes too.

    ReplyDelete
    Replies
    1. We could indeed make a branch of old code with this patch. Everything is under proper source control. We really do not like making branches for a lot of reasons. The new release was pretty much ready anyway - very minor tweaks.

      If the current alpha really was in the middle of major updates and simply not safe to release we'd have done a patch on the previous factory release, yes. In practice we like to make alphas usable, incremental developments, with major reworks that are not so being kept on a local build until ready, so in general we are able to make a new factory release with a few days notice by tidying up what we have and testing.

      So, was a judgement in this case to push forward with the beta, and in practice was not a problem. Even at 4am, a branch was considered though.

      Delete
  3. Nice work. I like the factory release idea and the auto email. We, UK Broadband, have pretty much standardised on you for all of our out of band management systems and soon, I hope, on the FireBricks for remote systems. It it this 'it should just work' attitude that makes me really like it all.

    ReplyDelete