2024-06-26

TOTSCO, gets worse

Seriously, this is bad.

TOTSCO have specifications for the whole process, but they are made of cheese. They don't even specify such fundamental things like the basic data types for something like an RCPID (Retail Communications Provider ID). I have argued with them, as one spec does say it is "4 alpha characters, not starting A", but they dismiss this as not actually the spec of an RCPID, and seem to have no issue with not having a specification?!?!

To be clear, I would expect it to be something like: "An RCPID is assigned by TOTSCO, and is 4 alpha characters not starting "A", or the 6 character string "TOTSCO". In JSON it is a string type value. By convention it started with an "R", but this is not a requirement and should not be assumed.", and I would even like them to reserve "TEST" as a special RCPID. I'll help them write a spec if they ask!

A clear specification to which all CPs can refer is essential. Heck, we are used to this with RFCs. The RFC is the reference and who had got it right or wrong is by reference to the RFC. 

But what is worse is the whole testing and integration process!

There seem to be these steps:-

  • A really simple messaging test (their simulator). It is flawed, but checks basic OAUTH2 at least.
  • A CP to CP integration test using their pre-production platform. <-- WE ARE HERE NOW!
  • Then live!

At no point is anything tested to the specification!!!!

I am not sure there is even a process for reporting and resolving a CP not following what little specification there is!

This is a serious problem, and as a simple example, we are currently going through the integration testing process with a buddy CP that has already done it. I won't name them, it is not their fault.

The first test

The first test was actually pretty good in many ways - they misread the details I provided and sent a residentialMatchRequest with an invalid account number, and we replied with an error, saying it was an invalid account number format. Yay, a good test.

So, I take that as a huge success of a test.

But no...

Their request was wrong in other ways, and now I see it, I have updated my system. They sent an envelope destination correlationID on an initial message which is not according to the specification. We mistakenly used that in our error reply. Oddly TOTSCO sent us a messageDeliveryFailure even though the other CP got our message, and we then barfed at the correlationID on that, because it was not one we issued!

So why TOTSCO sent the messageDeliveryFailure is unclear. But the other CP got it wrong in the original message anyway. What is worse is at least one messageDeliveryFailure is incorrect as well, according to the specification as it had no source correlationID, which is mandatory.

So at this point, we had a few checks missing, but the other CP had their message slightly wrong. They are the ones that have passed integration testing and are sending a wrong message to us. They fixed it and tried again, but TOTSCO then failed to deliver the message to us, which looks like another TOTSCO error.

Naturally my NOTSCO system picks up this stuff now.

Working with them

To be clear, we are working with the other CP here, we want to make it work.

Update: They had not gone through integration testing, which suggests they have been waiting at least a month for someone to buddy with, which suggests yet another serious problem in the process!

So, the score so far...

  • Other CP, 1 error (minor), fixed.
  • Us, 1 error (not handling their error well), fixed.
  • TOTSCO 2 errors, still awaiting a reply.

Update: Not a peep from TOTSCO all day so far, formal tickets raised.

Update: After raising tickets, I have some replies. They claim we did not respond within 2s, but my logs show no request, so some packet dumping next.

Update: One reply is interesting - their invalid message is apparently correct as two parts of the specification contradict each other.

Update: They said we did not respond in the 2s SLA, but when I asked for the SLA it states 3s (after up to 1s connection time), so no idea where the 2s came from.

Update: and wow...

Don't trust apache!

This may be of use to other CPs here. The SLAs are tight, they want a response (at http level) within 3 seconds.

It is run as an apache CGI function executable. It responds to stdout with Status, Content-Type, and content (JSON), and exits. That should be it. Simples!

My code was responding quickly, indeed, usually well under 100ms.  This was as measured in the code, and measured from an external connection (NOTSCO).

However, TOTSCO were still struggling and saying we were timing out. Very odd indeed, so I did packet dumps to prove them wrong.

To my shock, the packet dump showed a 5 second delay in the middle of the TCP.

After some experimentation, noting TOTSCO send Connection: keep-alive, I eventually found that if I sent a Content-Length, then the JSON, apache no longer fucked about, and responded instantly.

I can only assume some persistent connection thing - which is not usually very good with CGIs like this. But even so, having closed stdout and exited, I expected apache not to wait.

So, heads up, that 3 second timeout SLA can catch you out!

6 comments:

  1. Sounds like banking, again. The same supplier I mentioned in a previous comment:

    We were processing card payments as an issuer, which required us to be able to authorise and then match said auths during the presentment phase in order to reconcile our internal systems. Presentments are gratuitous notifications of external money movements issued by the card network when they actually claim funds from the issuing bank/payment institution; retail customers don't normally care about them or even see them, and the issuer does not get to decline them, but they are the actual settlement taking place. Sometimes they are shown as 'pending' and 'settled' on bank statements, or similar.

    Normally, a payment authorisation locks the funds on the customer's account (the balance appears to be debited). Internally, the value is transferred to a suspense account in the ledger pending settlement with the network. A few days later, the presentment arrives, and we attempt to identify the previous transactions so as to use funds from the suspense account. However, if not possible, we must debit the customer again - maybe we didn't have a preceding auth (it happens), but the money has been settled with the network either way. This leaves the customer with funds sat in our suspense account (not available to them) and a second transaction.

    Presentment handling has many poorly-documented edge cases and gets hideously complex when dealing with anything other than a simple transaction that was auth'ed and then presented as a single entity each time. Many merchants have opt-outs from the regulations, behaviours in certain countries and merchant types lead to interesting edge cases, but I digress...

    One of the key pieces of data for matching the transactions is the _auth code_, the six digit value issued by the network during payment processing and often printed on receipts. It's given in the presentment and is used, among various other identifiers, to match with the preceding auth so as to avoid this double debit situation. This field was very strictly documented in the provider's API docs as a string value, which would contain up to six numerals with possible spaces as leading padding. Unfortunately, they seemed to misapply this internally during some code refactor, treating these values as integers subsequently formatted to string literals and of course, not adding the leading zeroes. When everyone normalises deviance, and treats it as an integer internally, it's easy to forget what the standards say.

    They only did this on one of the paths (either auth or presentment), leading to mismatches in our matching code. To us, 12345 ≠ 012345 ≠ p12345 (p = space) but they (and most of their other customers, all of whom were unaffected) seemed to ignore this apparently minor implementation/correctness detail.

    Their response? Oh, "you should be treating them as integers". Not according to your API docs, they are not! So customers got double-debited, in aggregate, several hundred £k until the problem could be identified and reversed. We could have done reversals quickly, but we didn't trust that the provider would not also issue them themselves, thus double _crediting_ the customer - an even worse situation allowing the money to walk out of the door.

    Sadly, the incentives are misaligned, as all of these in-network B2B providers that we depend upon are not visible to the average retail customer, who vents their ire and frustration at their bank/supplier/whomever who has limited recourse to correct a system that is broken and dead on arrival. There is something to be said for engineering with a crystal ball when you don't just read the spec, but have to anticipate how that spec has been (mis)implemented and code defensively for that. Hyrum's law comes to mind too. https://www.hyrumslaw.com/

    ReplyDelete
    Replies
    1. Indeed, and I have a lot of tolerance built in to my implementation, which is "turned off" for the integration testing phase deliberately.

      Delete
  2. "The SLAs are tight, they want a response in 3 seconds."

    Is this for message acceptance, rather than a match response? I think that's 60. I'll try and find it all again, it was buried in some appendix or other..

    ReplyDelete
  3. You might be able to send "Connection: close" instead of "Content-Length: ..." if you don't want to support their keep-alive request.

    ReplyDelete
    Replies
    1. First thing I tried, did not help. Needed Content-Length.

      Delete

Comments are moderated purely to filter out obvious spam, but it means they may not show immediately.

One Touch Switching

It has been some weeks since One Touch Switching was fully live. TOTSCO say over 100,000 switch orders now, so it is making good progress, ...