We are about to embark on some major work - not just new features (hopefully a SIP gateway) but some associated rework of parts operating system. These will need some careful testing in alpha releases. So this is a prefect time to make a new factory release before we start that work.
Unfortunately, last night, we had a slight issue on one of the A&A LNSs. As we are waiting on BT moving one of the host links we are still running on two LNSs, one live and one backup, just that they are the new ones in the new rack. At around 8:03pm last night the live LNS stopped doing any RADIUS lookups.
It did not affect any existing connected sessions but anyone dropping and reconnecting for any reason could not get on line. The problem was not picked up by nagios and it is hard to work out what it would test that could be reliable to detect this - it is a new problem. Thankfully someone did send an MSO text and I was able to re-route new connections to the backup LNS. Before you ask why the backup was not being used anyway it is because the live LNS was still accepting the connections at the L2TP level - the behaviour from outside was the same as a duff login, and so not a reason to use the backup LNS. This meant the few people who could not reconnect were now back on line and everything was working. Overall almost all of our customers totally unaffected (apart from not being metered for usage for a while).
This meant I could try and find the cause of the problem. Eventually at around 1:40am I moved all lines to the other LNS and reset the problem LNS. This meant everyone reconnecting, which happened really quickly (the recent changes to our RADIUS authentication and accounting servers worked well and kept up). A few lines somehow hit a BT default accept and tried 10 minutes later but I am not sure we can do anything about that, sadly.
So this leaves me investigating what happened, and potentially delaying the factory release. The analysis last night basically showed that a counter had gone negative when that should not be possible. It points to some unexpected race conditions / interrupt case.
Now, before I get a lot of helpful posts from the coders out there, we have been doing this a while. In embedded coding you have to understand what happens at the processor instruction level and allow for interrupts and task switches at any possible point. It is easy to code something where there is a very very slim chance of an interrupt at the wrong moment causing problems. If you do that you can bet that it will happen, one day, and break stuff.
It is probably worth explaining this a bit as I know we have non coders reading this as well as people that have never done embedded coding.
In a high level language you might have a simple, and innocuous seeming instructions such as x++. This means adding one to the value of x. Simple. You consider it to be an atomic operation, but actually the processor will be reading x, changing it, and then writing it. It is possible to get a task switch or interrupt between these operation. Now if some other task or the interrupt does x-- (taking one off x) as well. Normally the two things result in the value not changing, one thing adds one, and another takes one away. But consider the bad timing of the interrupt as follows:-
- x starts with a value of 10, and you are doing an x++ operation.
- You read x in to a register, 10
- You add one to the register, 11
- There is an interrupt which reads x as 10, takes one from it making 9, and writes x as 9, and returns from interrupt.
- Now, back in the main thread, you carry on and write the register, 11, back to x.
The problem then is that the consequences of a wrong value can cause something to break much later on. If x is the number of items in a list, say, then things might not break until x gets to 1 but the list is now empty.
So, in embedded coding you have to be careful. Very careful. The fact that the compiler will hold variables in registers for extended periods is one issue. The fact some variables may take more than one memory locations (e.g. 64 bit counters) is a problem too as the read or write of the two locations could have an interrupt in the middle! The issue is even more complex when you have two processors able to access the same memory. This is why we have memory management at a hardware level, schedule locking, mutexes, and interrupt locked code as appropriate.
My challenge this morning is finding the mistake. It is code we have not changed for some time. Usually bugs happen as a result of coding (or "enbugging"). Mistakes are made, and interactions are not thought out and that results in a bug. But this is not new code. In fact, the change that may have triggered it is the likely to be the recent change on our RADIUS server making it a lot quicker to respond. This will have impacted the timing of the RADIUS requests and replies and the associated L2TP connections. It could be the explanation of "why now?".
The issue, so far, having looked at this until nearly 3am, is that the code is written with belt and braces - the only place the counter is touched is within one very carefully coded short function which inhibits interrupts around the relevant section of code. The counters are locked in the memory management to the one processor even. Basically, this can't have happened. They are the fun bugs to find!