2023-05-31

Defensive coding

Business systems can be written over long periods of time and in a variety of scripts and languages, and we (A&A) are no exception. Some of the systems started life more than 26 years ago and even before A&A existed.

Obviously all of the code, and scripts and web applications and so on have been updated over the years in many ways. If nothing else the advances in security best practice alone are reasons to update systems.

One aspect to the way some of the code has been written is "defensive coding". This is basically assuming the worst can happen (either by accident or malice by an attacker) and trying to ensure the outcome of such will always be "safe" in some way.

The way we (A&A) do Direct Debits is one example of this. At every stage in a very complicated set of processes and scripts we try to assume the worse and take the safest action on failure. And one area of this is the notices sent to customers. There are a load of rules we should follow, and (seemingly unlike so many other companies) we try very hard to follow the rules very very carefully.

For example, if we have notified a customer that we will "collect £20 on, or immediately after, the 1st of each month" we record that fact and try to ensure that DDs only go in if that is the case. The rules allow up to 3 working days later, but the second we are not collecting £20 exactly, or it is not exactly on the 1st or within 3 working days, we cannot rely on that notice and so the DD is not allowed. Indeed, if we notify the customer anything else such as cancelling a collection, we discard the record of that notice so a new notice has to be issued.

Another aspect is checking it all worked, this is a key step for defensive coding - scripts and systems to check things that should work have worked. Such a system flashed lights, dinged my phone, and even set off the very loud teletype in my hall in an attempt to make quite sure I knew things were not quite right and that some intervention may be needed. That said, the comments on Mastodon and call from staff were also a clue something had happened!

So what did happen yesterday?

With all of this in mind, shit happens, and did yesterday, and a load of A&A customers were told their 1st June DD collection was cancelled. So what happened?

The root cause was an update to one of our billing systems SQL servers. We do updates to all sorts of servers all the time, and it seems the timing of this was unfortunate. Many updates are to ensure all security patches are applied, and so done as soon as possible.

Once again, defensive code - the systems generating billing records, e.g. for calls, queue them up until the SQL server is back. And the billing system using these databases to make bills will re-run a few hours later if there is any issue accessing the data so just delaying bills slightly.

The problem is that there was, apparently, a small window each month where there is a process of "working out what people will be billed in two working days". This is a test/dummy billing run. This is done to confirm the regular direct debits are as expected and ensure they are submitted to BACS two days in advance, so two working days before the 1st of the month, i.e. yesterday. That was the system that failed to access the database, and unlike normally billing runs, which can run a few hours later, it left a load of customers set up to not be expecting a bill on the 1st.

This is a small window, and as a result it meant that a load of customers were not expected to have a bill on the 1st, even though they will. If the bill on the 1st had an issue it would have run a few hours later and still billed. But this was a test/dummy bill run in advance and run only once a day.

It is not just the 1st, we have billing on 4 week cycles, and every full moon, and all of these work in the same way.

What happened then is the direct debits scheduled for the 1st were cancelled because the test/dummy billing run did not say a bill was expected. The system sending the cancel emails knows to also forget the previous notice of collection. This means that not only were "cancel" emails sent, but once the bills do happen on the 1st they will need a new Direct Debit collection notice email for 5 working days to be sent - so following the DD rules.

It means people will have a DD collections on the 8th, not the 1st, and then from the following month should be back to normal.

Whilst it is a mistake, the system, being defensively coded, has followed the DD rules for notices to the letter, delaying collections, and sending new notices.

What have we done?

We have changed the logic slightly, which means this should not happen in quite the same way, but we have also made sure the ops team know not to do an update in the middle of this test billing run in future.

Sorry for any inconvenience caused by an extra week's credit / time to pay on your bills. As always we try to learn from our mistakes. And well done to my staff handling all the calls and emails today on this.

9 comments:

  1. I saw the headline, and guessed it’d be about Direct Debits ;) I checked the status page shortly after I read the cancellation email .. thank you for posting about it there as well — Ben P.

    ReplyDelete
  2. I wondered what was going on, but trusted A&A to Do The Right Thing, so I wasn't too bothered. Thanks for keeping us informed!

    ReplyDelete
  3. Just a suggestion, I think ah “oops, sorry, this was a mistake and here’s what to expect next” communication to customers wouldn’t go amiss.

    ReplyDelete
  4. Err, we did exactly that right away!

    ReplyDelete
  5. I didn’t get such a thing! I have “Cancelled Direct Debit collection” on Tuesday and the next email from *@aa.net.uk (apart from the 2FA message when I logged into the accounts system to see what happened) is “Direct Debit Collection”, but neither of them explain that there was an error, they are regular automated emails.

    ReplyDelete
    Replies
    1. We posted to aastatus.net on 30th May around 19:00 and on social.aa.net.uk the following morning at 7am.

      Delete
  6. Haha I assumed our lot had screwed up the account some how, so it didn't occur to me to check your status page

    ReplyDelete
  7. The thing is, you understood the mission. Most other companies would code things just enough to make them work.

    ReplyDelete
  8. Meanwhile, with another ISP....

    I signed up and my first 3 months were free, but I was paying £1/m for a static IP.

    This meant my first bill was £0.13.

    I got this threatening email from them:

    Hello , Your most recent Direct Debit payment has failed. Immediate payment is required to prevent a suspension of your service(s). Account Number: Invoice(s): Amount: 0.13 E-check payment was DECLINED. ERROR: validation_failed: Code: 422:
    amount: is lower than the minimum permitted amount Please contact us immediately and advise when payment shall be paid.

    And then when I forwarded them this email and asked them to read it including the error message and get back to me with a human response, got another reply demanding immediate payment.

    ReplyDelete

Comments are moderated purely to filter out obvious spam, but it means they may not show immediately.

One Touch Switching

It has been some weeks since One Touch Switching was fully live. TOTSCO say over 100,000 switch orders now, so it is making good progress, ...