A bit of history...
First we used asterisk, and we did a lot of custom back end scripts and database work to provide the features of our voice service. This worked well but does not scale well. Asterisk is an open source (ish) PABX, and ideal for a small office, but not that it scales to a general voice service that A&A offer. These days I would, of course, suggest a FireBrick based VoIP PABX instead.
Then we did my own SIP platform from scratch on linux. This was based on the RFCs, but SIP RFCs are hard work, and the code has got a bit worn out now, but it worked well. The nice bit is that SIP allows a pool of registration servers, call routing boxes, and so on - all very scalable. Sadly it turns out that most phones refuse to take calls from a different IP than they registered to. This thwarted that plan and multiple servers collapsed to one. The scaling plan is to put numbers on specific servers. It works well, but does not handle NAT connected endpoints, and there are still issues with firewalls on some kit.
The new FireBrick system is good (it better be, I wrote it). It is a completely new implementation based on a second look at the RFCs. Some key decisions based on the experience of the linux based solution and, importantly, the way devices work, has meant we operate in a specific way. We are no longer working as a relaying proxy but as an endpoint. i.e. devices make calls to the FireBrick and the FireBrick makes calls to devices, and it joins them together at the audio level. This solves many problems, and also means the audio is always to/from the same IP at the FireBrick end which helps firewalls and may allow some degree of NAT handling.
A key feature we are using is that FireBrick SIP works with IPv6 perfectly, as you would expect.
It also avoids all sorts of issues with switching the audio feed. Previously we passed on all SDP renegotiations end to end, but this caused issues as it meant changes at the RTP level in terms of sequence and source ID, and timing, as well as IP and port and some race conditions. This sometimes broke things. The FireBrick generates the audio stream with one source ID and sequence regardless, handling changes, call transfer, re-invites, and even tone generation seamlessly as one output stream. We should not have to go to such extremes but doing so creates the most reliable calls. It also means, importantly, that we can see a loss of media as we are always in the media path allowing calls to close when no BYE is received (kit reboots, stuff disconnects, etc) and so not incorrectly billing calls.
The FireBrick also does recording. It allows a call leg to be tee'd off to a separate SIP connection which can be a simple SIP endpoint (like asterisk), but if the endpoint claims to handle stereo a-law then the tee'd off call puts each side of the call on a separate channel of a stereo call. We have a simple linux endpoint that does that and emails the calls, and we are basing our A&A call recording on that.
Most of the work on the A&A side is RADIUS server based. The FireBrick allows all calls, registrations, and so on, to be validated by RADIUS. This allows a pool of servers to handle call routing based on our database back end, and to handle the call logging and billing.
One of the key things, on re-reading the RFCs, is a new way to handle scaling of the service. The RFCs describe a really useful concept of a redirect server. The idea is that registrations and calls go to that server which does not really do calls at all - it just responds with 302 redirect messages telling the caller where to connect to. This means we can share the customers between servers, and take servers out for maintenance and so on.
Sadly we have, again, been thwarted. Whilst the SNOM phones we tested have no issue with this plan it seems a lot of SIP devices get confused and assume the 302 response is an error and give up, or pop up a "retry" box. The carriers that handle our inbound calls also don't like 302's. So yet again a key part of SIP design that would allow elegant scaling and redundancy is screwed. Why do I even bother? FFS.
So now we are testing using DNS changes to manage pools of registration servers. At least we can have pools of RADIUS servers, and call recording servers and so on as they are no longer in the SIP proxy path.
But we are having fun doing SIP/NAT testing. We want to see if the FireBrick code can work where there is NAT. We have done some testing today using the Technicolor routers we ship as standard on Home::1 lines. To our surprise the ALG in the Technicolor is not bad, and means our end does not see anything odd or NATty. It kind of "just works" which was a slight surprise. We have tested against SNOM and Gigaset kit so far.
We'll be testing with non-ALG NAT devices soon to see if we can make them work. When we are happy with all of the testing we can move customers over to the new service which will be more scalable and reliable.