Wednesday, 22 May 2013

Another SIP challenge

I am finally sorting the new A&A voice server based on FireBrick SIP. The FireBrick work has been to create a reliable platform for high capacity call routing and this stage is using that to provide an actual service. It is a bit incestuous as the "real world usage" of the FireBrick provides a lot of feedback in to the FireBrick code design, but the end result is a good A&A service and a good FireBrick VoIP platform.

A bit of history...

First we used asterisk, and we did a lot of custom back end scripts and database work to provide the features of our voice service. This worked well but does not scale well. Asterisk is an open source (ish) PABX, and ideal for a small office, but not that it scales to a general voice service that A&A offer. These days I would, of course, suggest a FireBrick based VoIP PABX instead.

Then we did my own SIP platform from scratch on linux. This was based on the RFCs, but SIP RFCs are hard work, and the code has got a bit worn out now, but it worked well. The nice bit is that SIP allows a pool of registration servers, call routing boxes, and so on - all very scalable. Sadly it turns out that most phones refuse to take calls from a different IP than they registered to. This thwarted that plan and multiple servers collapsed to one. The scaling plan is to put numbers on specific servers. It works well, but does not handle NAT connected endpoints, and there are still issues with firewalls on some kit.

The new FireBrick system is good (it better be, I wrote it). It is a completely new implementation based on a second look at the RFCs. Some key decisions based on the experience of the linux based solution and, importantly, the way devices work, has meant we operate in a specific way. We are no longer working as a relaying proxy but as an endpoint. i.e. devices make calls to the FireBrick and the FireBrick makes calls to devices, and it joins them together at the audio level. This solves many problems, and also means the audio is always to/from the same IP at the FireBrick end which helps firewalls and may allow some degree of NAT handling.

A key feature we are using is that FireBrick SIP works with IPv6 perfectly, as you would expect.

It also avoids all sorts of issues with switching the audio feed. Previously we passed on all SDP renegotiations end to end, but this caused issues as it meant changes at the RTP level in terms of sequence and source ID, and timing, as well as IP and port and some race conditions. This sometimes broke things. The FireBrick generates the audio stream with one source ID and sequence regardless, handling changes, call transfer, re-invites, and even tone generation seamlessly as one output stream. We should not have to go to such extremes but doing so creates the most reliable calls. It also means, importantly, that we can see a loss of media as we are always in the media path allowing calls to close when no BYE is received (kit reboots, stuff disconnects, etc) and so not incorrectly billing calls.

The FireBrick also does recording. It allows a call leg to be tee'd off to a separate SIP connection which can be a simple SIP endpoint (like asterisk), but if the endpoint claims to handle stereo a-law then the tee'd off call puts each side of the call on a separate channel of a stereo call. We have a simple linux endpoint that does that and emails the calls, and we are basing our A&A call recording on that.

Most of the work on the A&A side is RADIUS server based. The FireBrick allows all calls, registrations, and so on, to be validated by RADIUS. This allows a pool of servers to handle call routing based on our database back end, and to handle the call logging and billing.

One of the key things, on re-reading the RFCs, is a new way to handle scaling of the service. The RFCs describe a really useful concept of a redirect server. The idea is that registrations and calls go to that server which does not really do calls at all - it just responds with 302 redirect messages telling the caller where to connect to. This means we can share the customers between servers, and take servers out for maintenance and so on.

Sadly we have, again, been thwarted. Whilst the SNOM phones we tested have no issue with this plan it seems a lot of SIP devices get confused and assume the 302 response is an error and give up, or pop up a "retry" box. The carriers that handle our inbound calls also don't like 302's. So yet again a key part of SIP design that would allow elegant scaling and redundancy is screwed. Why do I even bother? FFS.

So now we are testing using DNS changes to manage pools of registration servers. At least we can have pools of RADIUS servers, and call recording servers and so on as they are no longer in the SIP proxy path.

But we are having fun doing SIP/NAT testing. We want to see if the FireBrick code can work where there is NAT. We have done some testing today using the Technicolor routers we ship as standard on Home::1 lines. To our surprise the ALG in the Technicolor is not bad, and means our end does not see anything odd or NATty. It kind of "just works" which was a slight surprise. We have tested against SNOM and Gigaset kit so far.

We'll be testing with non-ALG NAT devices soon to see if we can make them work. When we are happy with all of the testing we can move customers over to the new service which will be more scalable and reliable.

Regression testing


  1. I have always been curious how SIP is *supposed* to work WRT endpoints vanishing in the middle of a call (and therefore not sending a BYE). Under Asterisk I've certainly noticed the occasional call that claims to still be ongoing even though both endpoints went away hours before. My understanding is that the SIP proxy doesn't poll the client to ask if the call is still in progress, so its relying entirely on the endpoints not vanishing mid-call. I've also had occasional issues with my POTS<->SIP gateway not hanging up the call in the event the handset crashes, which is a problem when calling a POTS line since POTS doesn't allow the callee to hangup the line!

    Also I note that under Asterisk, some of my handsets very occasionally fail to ring, which I presume is down to a loss of the SIP packet (although this should be resent?!).

    One problem I've noticed with Asterisk being an endpoint rather than a proxy, is that media isn't renegotiated - for example, Asterisk will negotiate a video stream with the caller before connecting to the callee, and that video stream will remain in place even when its discovered that the callee doesn't support video (asterisk just bitbuckets the video stream, which means lots of wasted bandwidth that could be avoided). I presume though that this is just an artifact of Asterisk's implementation rather than an inherent problem of being an endpoint rather than a proxy?

    1. Indeed. What the FireBrick does is spots the media stop, as it is always in the path, and then does "poll" the endpoint with a re-invite so as to check. It is valid for media to stop when on hold, for example, so this approach allows that as long as the end point still answers the re-invites. Also, we only negotiate a-law audio so we can always join the calls up - it is a deliberate restriction which means it works the same as the PSTN but again makes for reliability.

    2. Can you not renegotiate the codec once the call has started? (I would've expected that to be do-able with a reinvite?)

    3. Yes, and indeed, the linux based system does that, but it is nothing like as "clean" in the way or working and has all sorts of problems of its own. This is the lowest common format but the result is "just works", and is good call quality too.

  2. Would you like a Cisco 7960 to add to your collection?