Friday, 18 July 2014

Software release cycle

Software can have bugs, and any ongoing development has to consider carefully how the updates and development is released.

At one end of the spectrum, and obviously "best practice", one has carefully written and reviewed specifications not only of the overall system but each change that is being considered. One has a team generating the necessary test specifications, and another team writing the test systems, and another team that do module testing and regression testing. You have the developers and then people that review the code changes. Then you have alpha release to internal users, and beta release to external testers and then finally a planned and announced release along with a detailed and tested roll-back plan.

At the other end of the spectrum you have code that is hacked around on a live system and not properly tested and constantly changing and broken.

Both extremes have serious problems.

Interestingly, we are suffering from some of the "best practice" approach with the new roaming SIMs as it is taking ages for the mobile operator to get the new tariffing system in place for this. In the past, even with a serious bug in BT's 20CN network, it has literally taken years to get the bug fixed. The mobile operator is actually very agile for a telephone company.

We are at neither extreme - but annoyingly, over the last couple weeks, a couple of issues have come up which makes us want to improve things. We tend to be agile which means that we have much more rapid processes for testing and deployment. It does not mean we have none, obviously, but it has the down side that occasionally the live systems can run in to problems. Part of being agile is that you also have to be very good at fire fighting to reverse or resolve such problems, and we are good at that. We have a very good team.

I have mentioned some of the processes we go through for things like FireBrick development in the past. We have pretty good procedures for our LNS updates which use FireBrick. The issues of late have specifically related to our VoIP platform (though there was also a database issue which affected VoIP).

We have a couple of legacy platforms that are being phased out, but our main VoIP system is a pair of call servers, and a pool of RADIUS servers to direct calls as well as logging CDRs for billing. We have a pool of call recording and voicemail servers as well. We also have a separate test VoIP server connected to separate RADIUS server and separate call recording server.

When we make changes, fixing bugs or adding features, we test these locally, then we test on our office server and the test VoIP server. We are able to direct SIMs to the test server as well as have customers log in to the test server if needed. Once we are happy with the changes we deploy them on the live servers. In some cases there have been important changes that need to rolled out to resolve customer issues - lately there is a customer using a specific make of call server which has a number of quirks. This has meant that we are updating the live servers every couple of days.

We're not entirely happy that this system provides enough of a buffer between the stable customer experience of the service, and the active changes we are making, so the plan is to set up two more call servers along with RADIUS and recording platforms.

We'll have our test server which is really only for our use and the one customer at a time with which we are working. We'll have the agile test servers on which we deploy new code regularly and on which we expect there to be a number of customers that need the latest features or just like to be leading edge. We'll then have the two existing servers that are stable. The idea is that we can have people on the test servers indefinitely if necessary even if it is months before we do a new stable release. These test servers should be at least as stable as the existing servers, but the stable servers used by most people will change very infrequently.

Because the call servers all work within a specific published range of IP addresses, this work is a bit tricky and tied in with the legacy servers being removed. However, we think it is an important step for ensuring we have a good quality of service for customers.


  1. Release early, release often

    1. Not for business critical code. The delemma which the Rev is facing is pretty common; balancing the need to get bug minimised code into service without huge delays which ensure that known bugs stay in service longer.