Saturday, 27 April 2013

Spring cleaning my RADIUS code

I explained how code can wear out for all sorts of reasons, and one of the bits of code we us in the FireBrick has finally reached the point where it is better to start again than bolt on extra bits. It is the RADIUS client code.

The original coding used RADIUS, which is a remote authentication for dial-up, as part of the L2TP code (used to handle broadband lines and mobile these days not actually dial-up).

The RADIUS was, itself, a relatively small part of the L2TP code. It is used when someone connects to get a yes/no as to whether they are allowed and a few details like what IP is assigned. It is also used for accounting to report amounts of usage on a connection. The logic is simple, a message each way for each transaction. There is some authentication using pre-shared secrets. It is very much tied in to the sequencing of the L2TP and PPP operations.

So the original code design built RADIUS in to the L2TP code. It was a sane choice at the time.

Over the years the RADIUS code has grown. We have more options on authentication, and a lot of handling for load and scale - like how many concurrent requests can be outstanding to a RADIUS server and how do we pick the server, and blacklist a dead server, and tell it has come back, and so on. Then RADIUS became more complex with support for "Change of Authority" and "Disconnect" messages.

Even so, the RADIUS code was still reasonably sensible, and still only for L2TP.

Then we added VoIP, and want to use that for RADIUS. Suddenly RADIUS is no longer a small side-effect of L2TP, but a more general facility needed by VoIP and L2TP.

That is the time to change the design - now we need a carefully designed RADIUS client module that can provide RADIUS services for any other part of the code in a safe way. Of course, as soon as you think of this you can dream up lots of ways RADIUS could be useful for other things :-)

When you design a module like this you have to carefully consider where you "draw lines". Drawing lines is important! You have to define clearly the areas for which the module is responsible, and what is expected of the application. These have to be logical, clear, and obvious "lines". This makes it easier to test the module, and decide where the code goes, and avoid duplicating code. The idea of creating clean interfaces and partitioned data and responsibilities is actually key to object oriented coding, but is a good programming principle in any system. Previously the RADIUS was all behind the lines of "what the L2TP module does".

Of course, this has meant a couple of annoyances that come out during the code. I have made the module responsible for picking the RADIUS servers, and knowing (from config) the IPs and secrets used for the servers. The application never sees these. The RADIUS module provides functions and macros to manipulate the AVPs (the attribute/value pairs in the RADIUS messages). The module is responsible for retries, timings, stats, blacklisting RADIUS servers, and so on.

One nice thing with an interface like this is some logical state, and responsibility. i.e. when the application asks for a RADIUS request to be sent, there will always be exactly one callback when the response is received, or there is a timeout, or even if the application cancels the request. It is the responsibility of the module to make sure that happens with whatever tasks and timers it needs.

I realised early on that this means some special handling of a few AVPs. For example, the module has to look in to the RADIUS packet and update the "delay" attribute used on RADIUS accounting. This is a number of seconds since the accounting happened, and, of course, if we are retrying after a delay then that needs to move on in time but we can't expect the application to re-make the message with the extra delay.

There were a couple of other gotchas that caught me out - the PAP User-Password field is encoded using the challenge for the request and the shared secret. This is stuff the RADIUS module knows, but the application does not. So the application adds the password "clear" and the module scrambles it. There is a similar effect for received Tunnel-Password attributes that are scrambled in a way only the RADIUS module can decode. This means these AVPs need special handling as they change for each request.

Of course, to make use of the new module means ripping out all of the old code, and that is the scary bit. I have to look at each part, read the comments, read the code, and see if it makes sense. If it does, then I have to work out how it fits in to the new way of working. If not, I have to work out what I was thinking when I first wrote this. That can be fun. Thankfully RADIUS is pretty well covered by some good RFCs. Then I have to test everything, and, of course, make sure documentation is up to date.

The end result looks OK. Lots of testing, but now I have to test on a larger scale and make sure it works properly. Hopefully next week I can roll it out.

But it does mean I can finally progress my VoIP system for use as a large scale voice server with RADIUS based authentication, call routing and accounting.

1 comment:

  1. If you are indeed rewriting this, we would love to see per-session firewalling configured via RADIUS VSAs sent to the FireBrick; very much like:

    (in particular, the ACLs bit)

    Even if you do not add support for dynamic ACL/firewall rulesets right away as I am aware it would require work in other subsystems; it would be good if your new RADIUS client implementation could be written with this requirement in mind so that it would make it easier later on when you have lots of spare time to implement it :-)