2024-09-12

OTS correlation ID

It is complicated to report this as some details would be covered by ramp up rules, where I cannot provide details, but now we are in live on One Touch Switching I think I can report. Even so I will not name CPs for now. This is more about the process, and specification and the fiasco.

To be very fair the main author of the specifications is someone that I feel a lot of sympathy for, under pressure, and then under a change freeze he would not have agreed. He did his best and none of this is a dig at him or his employer. Much more a dig at the process.

Weirdly correlationIDs have turned out to be a big issue, and continue to be so.

What is a correlationID?

Basically one of the message fields in One Touch Switching is called a correlationID

The big issue is the vague specification. It is a field the sender of a message sets so they can correlate the response.

Hindsight

To be clear, in hindsight, and what I have said, is correlationIDs should be per message unique and a UUID. Simple. If the spec had said that a lot of pain and hassle would have been avoided.

Problems?

The problems are various...

  • It is not a defined format
  • It is not a defined maximum length
  • It is not well defined when and how it is unique, or not

TOTSCO 66

So the issue is that some CPs assumed it would be per CP per message unique, and so used it to identify (and ignore duplicates). Indeed a notice from TOTSCO suggested it is used to de-duplicate messages.

A real issue is "why duplicates" which is another issue - they have to be failing to respond in 3 seconds for that, and maybe that is what they should have fixed.

There are also a lot of cases where a duplicate is not an issue, if done right.

But TOTSCO 66 said you can de-duplicate based on per CP correlationIDs being unique per message.

TOSTCO 67

The next notice back peddled a lot, and I was instrumental in raising this I think. All because the specification was so vague. The new recommendation was two fold (a) don't de-duplicate on correlationID, and (b) don't send duplicate correlationIDs. A pragmatic approach without direct blame either side.

Indeed one idea was, if you use correlationID as a more "overall message flow" ID, append or prepend something to each message so they ends up unique.

So CPs are, indeed, doing both, yay! We have all seen a lot of work making this happen, and well done to all the CPs doing this.

To clarify we went through something like three iterations to get this sensible on our systems.

Length

Oh, did the specification say how long a correlationID could be? No. It did not. Why would you say that?

Well, maybe it did, sort of, TOTSCO link to some schema thing (swagger?!) which was updated after the frozen spec and the latest version of that says 256 characters. That is mental long, and I have no clue if 256 characters or 256 bytes (they are different in the UTF8 world of JSON). Just to say, A&A can handle any length up to mega bytes, if needed.

Turns out TOTSCO had limits on what they would handle, as this is a message envelope thing. They were ignoring, and not apparently reject cleanly, if too long. I have not tested with 256 x big unicode characters, yet!

But we have a big CP that would not handle more than 64 characters, but sorted that before 12th, well done, I won't say who. It was a very reasonable choice for them, and I understand it. But well done moving to 256 characters, or bytes, in time.

We now have another big CP that would not handle more than 50 characters. Not yet sorted, but will be soon.

Why such long correlationIDs? Well BECAUSE of TOTSCO 67 notice, CPs using a 36 character UUID and adding a timestamp. That just pushes over 50. And to be honest 50 was also a reasonable design choice.

So 256 characters, is that OK? Guess what, the tinytext type in mariadb is 255 characters, FFS! If I had to make a silly long limit I would have said 255 not 256, really.

Ping pong

One of the mistakes we made at the very start, for a day or so, was assuming correlationIDs were ping ponged over the switch process (match, order, update, trigger). I had fields in the database to do this and code to do it (they were tinytext).

Why did I assume this? Well the specification did not say, but the test cases did, they had correlationIDs on an OrderRequest following on from the MatchRequest. They looked a lot like they should ping pong over the process.

Well I worked it out, but did every small CP that is live today?

The answer is no, they have not, and at least one small CP (I feel sorry for them) very carefully followed the spec, and the examples in the test schedule, and did this, like we did.

They will not work with almost any of the other CPs now live. They have to make major changes, now, when live. Really sorry for them.

Helping?

Seriously, we all need to work together. We have a test system that can help these new small CPs, and I am happy to help. https://notsco.co.uk/

No comments:

Post a Comment

Comments are moderated purely to filter out obvious spam, but it means they may not show immediately.

One Touch Switching

It has been some weeks since One Touch Switching was fully live. TOTSCO say over 100,000 switch orders now, so it is making good progress, ...