2024-07-24

TOTSCO 66 is guidance, optional

I feel I need to explain this.

The TOTSCO call today, first I have been on, and wow!

But a key point was TOTSCO bulletin 66, which is actually quite sensible guidance.

So what is the problem? It is guidance, not mandatory. CPs don't have to follow it even.

So let me try to explain.

If ANY CP follows that guidance then ALL CPs have to change how they create a source correlationID to be totally unique.

The API specification does not require that, so it is a real change.

If some other CP does not do that, the recipient CP, following the guidance, may assume a duplicate message and discard it.

This is non trivial.

2024-07-23

OFCOM disinterest in OTS?

OFCOM sent a mildly threatening letter about One Touch Switching and the impending deadline.

I replied by email, two, nothing. So wrote, and nothing.

So now this.

We'll see if they reply.

Update: OFCOM want a call, yay!

Update: Useful call, OFCOM listening.

OK, being constructive

I am pondering what could be done right now. So some thoughts...

  • Firstly - this is not simply pedantry, or my getting pissed that I misread the spec - this is not hypothetical. Working with other CPs, and monitoring testing on my NOTSCO test platform, yesterday, half the CPs testing were falling foul of the latest checks on source correlationID added because of TOTSCO bulletin 66. I see other CPs are running in to all the issues I have raised with the specifications. Most of the errors my test platform picks up would not be picked up by the existing TOTSCO testing process. 
  • I feel some people with some clue how to write a clear specification and understand the challenges of coding systems to meet such a specification are engaged with TOTSCO and taken seriously. I can help (though probably not for free - though have I made many suggestions anyway).
  • I feel the specifications need to be consolidated and simplified and put in one place - there are too many parts, in different places, some freely available, some under a login on the control pages, some XLS, some a web page, some PDF, and so on, it is a total mess. Clear and complete set of specifications in one place.
  • TOTSCO need the specifications updated, and kept updated, and a process to notify updates to all CPs involved, so they can ensure compliance. This means proper change controlled notices of what has changed, not a random bulletin that assumes/implies a serious change to a spec that is in a change freeze! Even if this was a weekly spec update with all changes sent to all CPs.
  • I feel going for 12th Sep is fine - we have to start somewhere, but for the start of full OTS usage, and not a requirement for all CPs to be on line, simply because I am not sure there is time for that. But from that date, all CPs that are live on TOTSCO should offer it as part of their ordering process, related to other CPs that are on TOTSCO.
  • Some later deadline for all CPs, maybe an even later one for small CPs.
  • I definitely think a self service test platform is needed for API and OTS with all sorts of scenarios (valid and error testing) and messages both ways, needs to be in place, and a key part of compliance testing. I have one, and I am happy to work with TOTSCO if they want to use it. But it literally took only a couple days to make, so TOTSCO could make one themselves. Testing should be to a reference implementation and against the specification. In practice making this a CP on pre-production (and even live), called TEST, with a control page on TOTSCO to manage tests and replies and logs, would be ideal.
  • We also seem to lack a way to contact other CPs when live to address issues - and a way for TOTSCO to arbitrate that one CP claims another CP does not meet the spec. A clear spec is needed, but a whole inter CP dispute process needs to be in place - and a reference test system would be invaluable for that.

2024-07-22

TOTSCO moving goal posts, again!

One of the big issues I had in initial coding was the use of correlationID on messages. The test cases showed it being used the same on a sequence of messages, e.g. a Switch Order had a destination correlation which only made sense if it was a response to a Match Confirmation, for example. I was wrong, but not for lack of reading the spec.

The API spec says this: In a source element, the correlationID must always be provided, the format can be anything the originator chooses to support their messaging process but should be sufficiently unique to allow correlation of response with request over a reasonable period.

This makes it clear what purpose the correlation ID has, it matters to sender so they can correlate response with request. It also makes it clear the sender is who chooses the correlationID.

Now, for that purpose a Match Request, and subsequent Switch Order, and Switch Order Trigger could all have the same correlationID. Indeed, arguably, a sender could use the same correlation on all Switch Order related messages because the messages all carry a Switch Order Reference, which can be used to tie the response to a specific order. An obvious choice, and we nearly did this, was to use the actual switch order reference as the correlationID.

Also, there is nothing to stop an originator, when generating a reply, to use correlationIDs differently, as they don't expect a response to that reply, and there is no correlation of response with request. Again, an obvious choice for the various switch order messages would be the switch order reference, as this is the one thing missing from a MessageDeliveryFailure message, and would allow that error to tie to a switch order.

TOTSCO Bulletin 66

TOTSCO just released bulletin 66, on handling received (from hub) messages better, notably on response times and validation, but also on handling duplicate requests. They detail a recommendation that the messages are cached for a while, per originating RCPID and source correlationID, and use this to spot a duplicate.

If a sender chose to use the same correlationID for a Match Request and Switch Order, which is definitely sufficiently unique to allow correlation of response with request as per the spec, the recipient would see the Switch Order as a duplicate message and ignore it, maybe resending the Match Confirmation.

If the sender chose to use the SOR on switch order messages or replies, the recipient would see all messages after the first as duplicates, and ignore them.

So now, if effect, based on just a bulletin, the specification mandates that every message sent (request or reply) has a unique correlationID, something not in the spec. In general this is a good idea, but the API spec should have stated that at the start! It now means the source correlation ID matters to the recipient as well, not just the sender. And they have not changed the spec as it is in a change freeze. Oh, and there is no size limit for a correlationID.

The bulletin does not even actually say the sender correlationID has to be unique, it basically assumes it is and explains how recipients can assume it is for spotting duplicate messages!

Once again, a fiasco.

P.S. Our implementation does unique source correlationID already (uses a UUID).

Also, I have updated the NOTSCO test platform to warn of duplicates, and generate a duplicate as well to test CPs handling of duplicates.

Just to add, the confusion caused by the poor specifications is real. Not just that we were confused by the examples implying a way of working, but I monitor the NOTSCO testing and see other CPs doing similar things, based on the specification, that are going to be problems. I'm just waiting for this new check to kick off and show a CP assuming they can pick source correlationIDs for their own purposes (this did happen later in the day). In fact, looking at logs today (we only keep for a day) I already see duplicated correlationIDs that will break when sent to any CP following TOTSCO Bulletin 66.

This is a bigger issue than you realise!

We originally coded with a way of working with correlationIDs that would fall foul of any CP following bulletin 66. We changed later once TOTSCO confirmed that basically its test cases are wrong.

I am seeing now half of the CPs testing on NOTSCO hitting the duplicate test.

The whole way TOTSCO do testing is two random CPs testing against each other. That would NOT have picked up this at all. So the CPs carry on.

Then, wham, on 12th Sep, some OTS messaging breaks because one of the CPs followed the spec (which has NOT BEEN UPDATED) and one implements the de-duplication in bulletin 66.

The fact TOTSCO do ZERO formal testing against the spec is just a serious problem - that is just irresponsible. I'm amazed OFCOM allow it.

2024-07-21

Bulk ESP32-S3 programming

Programming an ESP32-S3 is really easy.

The S3 has build in USB, which means literally just connecting GPIO 19 and 20 to D- and D+ on a USB socket - not even any resistors! It operates as a USB device out of the box, appearing as a serial/JTAG port. It just works on standard USB serial drivers on linux and MacOS (and I assume, Windows).

Using the ESP IDF tools I can type.

idf.py flash

And that is it, it detects the chip, and flashes the bootloader and code.

No special leads, it is that simple.

Smaller footprint

The only issue is that this all works if you have the complete ESP IDF installed, with its python and cross compiler environment, and your code checked out and built (or able to build). This is not hard, there are simple steps to do this, but it takes a lot of space.

So, I wanted something simpler so I could make a small machine, ideally a Raspberry Pi, that just flashed code. Thankfully, all I need is esptool, i.e.

pip install esptool

And then I can flash using that rather than the whole IDF. It is more complex, e.g.

esptool.py --chip esp32s3 -p /dev/ttyACM0 -b 460800 --before=default_reset --after=hard_reset write_flash --flash_mode dio --flash_freq 80m --flash_size keep 0x0 release/LED-S3-MINI-N4-R2-bootloader.bin 0x10000 release/LED-S3-MINI-N4-R2.bin 0x8000 release/LED-S3-MINI-N4-R2-partition-table.bin 0xd000 release/LED-S3-MINI-N4-R2-ota_data_initial.bin

But that is simple to script. One tool installed and the binaries from my repository, and job done!

One device after the next

The challenge is that I want to do bulk programming - i.e. flash a device, get clear confirmation it worked, then just plug in the next device. I don't want to run a command each time.

Getting confirmation it works is easy as all my boards have an LED, usually a tiny 1x1mm WS2812 colour LED, and that starts blinking as soon as the board starts. Indeed, the code is signed and checked on boot, so if any issues flashing it won't start.

Indeed, where I have done this I have had there separate instances running and 3 USB ports and leads, so I could plug in one after the other, unplugging when I see it is flashed and running. Really slick!

What I was doing was

idf.py flash monitor

This flashes, and then runs, and monitors serial output (which can be useful if there are additional diagnostics to show, but the main indicator is the on board LED).

The problem is you then have to kill the monitor for each board (ctrl ]). Even just disconnecting USB appears to wait for device to reconnect. I created a convoluted bit of C code to run monitor, and check output, looking for the string it gets for a new device, and exit. That way I could flash, and then run this, in a loop. Works well.

The problem is that, once again, this is using the whole ESP IDF just to run the idf.py command. And it seems esptool does not do a monitor function!

My own monitor code

In principle it is really easy to make my own C code to open the USB (serial) port directly, and set DTR and RTS appropriately to reset the board in running mode (rather than bootloader mode).

This worked perfectly on my Mac. Some simple code, waits for the right string to indicated a new board, and exits. It also does not need the whole ESP IDF to run.

But no!

  • The first issue is that the ESP32, with no code loaded, seemed to trip the power on the USB port. It is odd, and maybe the regulator I am using creates just enough of a power spike, or something (never bothered my Mac), I don't know. The fix was a powered USB hub.
  • The next issue is that once code is loaded, even with a powered USB hub, it seems the start up with WiFi is enough to then trip the power, so it constantly resets and does not blink the LED.
  • I finally found a power hub that just works with linux.

But there is weirder!

The other weirdness was that on the raspberry Pi, it seems it would not play properly with RTS and DTR and constantly came up in bootloader mode regardless. I simply could not get it to play, it was like DTR was not being set. The only difference seems to be it is using an OTG serial driver. On two separate bigger linux boxes, using a different driver, it works as expected (and ends up in a boot loop, as I said above).

I don't know how one can change the serial driver on a Pi, suggestions welcome (google did not help me).

2024-07-20

TOTSCO - the top level - ordering

This should give you some idea of the issues with a simple matter of providing a broadband service. Bear in mind the broadband service may have a linked telephone service - i.e. be ADSL or VSDL on a phone line, and the customer may, or may not, want that number to carry on working some how.

It used to be we could take over the broadband and leave the telephone alone, or, we could take over number and broadband as a BT line, or we could take over broadband and port the number to VoIP.

It is more complicated with the retirement of old fashioned phone service - we cannot move the line to broadband with us on a telephone line any more, we have to move to something called SOGEA or SOADSL, which is a broadband service with no telephone service on the line. So we have to offer customer choice to lose number to move to VoIP.

So lets look at some of the combinations we have to handle, and do One Touch Switching for...

  1. It could be a service that is totally different, like Starlink or something - we provide new broadband and OTS co-ordinates the cease. Simple.
  2. More likely, BT/Openreach broadband and BT/Openreach phone service using a BT number range number. Yes, that specific set (regardless of resellers, which may not be the same for broadband and telephone) is special as we can do an integrated port moving broadband and porting phone as one order in to BT. As you can imagine working out it is this exact combination can be tricky, and end user may not know.
  3. Could be BT/Openreach broadband, and a BT/Openreach phone line, but not a BT number range number, in which case we migrate the broadband and port the number separately as we cannot do an integrated port.
  4. Could be BT/Openreach broadband, and MPF phone line, in which case harder to check, and we can port the number separately as we cannot do an integrated port.
  5. Could be BT/Openreach FTTP with and associated phone number which may be even VoIP, but is linked at the BT account so would die if migrating broadband. I think that has to be a separate number port, but not sure - it may allow an integrated port if a BT number range. We'll have to test that one to be sure.
  6. Could be BT/Openreach broadband and BT/Openreach phone service, but the new service is FTTP, so a separate physical service. This can be coordinated to allow old broadband to be ceased but leave phone line in place, at least for now.
  7. Could be BT/Openreach with no phone number associated, yay! simple migrate.
  8. Could be CityFibre which won't have a phone number, yay! simple migrate.

For the OTS, somehow we have to explain the options so they can make an informed choice!

Porting the number adds an extra step too, now.

  1. The OTS match for broadband using number to identify it may (or may not) come back with an option to retain/cease, or we could do the OTS with IAS and NBICS "port" request, making one "switching order" for broadband and number port, if that is offered as an option.
  2. The OTS match may or may not mention a number linked to the line, depends if the reseller of the broadband knows if there is a number and what it is - the number could be a totally different reseller. But we may be able to work out the service has a BT/Openreach number based on the broadband checking in BT. If the customer knows the number we may be able to do an integrated port on the broadband. It is not impossible for neither the old broadband retailer, nor us, to know there is a number, and then that number gets zapped - so we have to ask the customer if they are sure, regardless.
  3. If the broadband OTS does not have a number port on the same switch order, we'll have to do a secondary OTS for the number port, possibly with different retailer, for the same address. Then we have to manage and track two switch orders. We probably need to do that even for the integrated port option.
  4. Either the broadband, or the number, or both may not be able to do an OTS check if the service is a business, or the retail provider is not on TOTSCO yet, so we have to handle that.

At the end of the day, this is a couple of extra pages of stuff to fill in on our order forms for customers now! It also adds new ways for things to go wrong.

The very small light at the end of the tunnel is the telephone number porting OTS should advise the Network Provider and the CUPID which should allow the port to go smoothly. We're looking forward to testing that!

2024-07-19

TOTSCO journey (to help other CPs)

I have done many blogs, so this is a summary, mainly aimed at other CPs. I hope it helps.

Gotchas

I am going to try and cover some gotchas, but a lot can be avoided by checking out ProposedChanges.md

Get on board

If you are a communications provider doing residential, fixed location, broadband or telephone service you have to do this shit, sorry. OFCOM rules, and so law. Get to totsco.org.uk and sign up. There is a deadline of 12th September 2024 for all retail CPs covered by this to be ready.

You may, instead, be able to use a managed access provider, but that sounds iffy to me as they need access to all your data, and you probably still have to do most of the work to integrate with your systems I expect. To be quite honest, it is not that hard.

How long?

I spent a week on this initially - much was cleaning up our address records, and reading specs, before coding it. I made a test system as part of this, but nothing like NOTSCO. I have done various small bits since, and spent several days integrating to our ordering, but mostly it has been waiting for TOTSCO. I spent another week making the NOTSCO platform from scratch, and wish I had done that first. Total elapsed time a little over 5 weeks. I'll not doubt do more once we are live and we encounter some unexpected edge cases.

The specifications

There are a load of specifications, and a few you need to read are:

  • API specification
  • OTS message specification
  • Industry Process
  • Error codes list
  • Network Operators list
  • Example messages

But there is a lot, and not very well structured, or consistent. I suggest also checking ProposedChanges.md

API to TOTSCO

The basic steps in talking to/from TOTSCO are not hard. They have a choice of authentication and we chose OAUTH2. After the basic OAUTH2 you are basically sending JSON payloads each way to/from the hub, and the format is simple with an envelope and payload. There are loads of tools and libraries to handle such APIs.

Watch out for my correlationID section in ProposedChanges.md as this caused me to waste a day, at least, as I assumed the examples implied a longer term correlationID. You need one on every message you send, and may as well be unique to each message, so you can correlate a reply. You need to correlated the reply for a match request anyway, and should for other messages as they can come back as messageDeliveryFailure, so can't just use the switch order reference to handle the reply.

OTS messages

The next level is the OTS switching messages. These fall in two parts. The match request and reply, and then the set of switch order starting, updating, cancelling, and triggering.

You may want to make a system for sending and receiving these, and then integrating that in to your ordering and customer management systems. I chose to make a library with some good command line options for testing and then a higher level integration of that library in our ordering system.

Making a match request

In principle this is actually quite easy, there are not many options or fields. The only hard part is correctly forming an address, and if you can, using UPRN. For IAS that can be it. If you don't get a match you can ask customer for account number and circuit ID and add that. For number porting you have a telephone number, which is the crucial component. We use BT/Openreach address tools which make this relatively easy, and provide a UPRN, but O/S have an AddressBase product which is good as well.

Matching an address, and sending a response

This is slightly harder, the main gotchas are address matching - the industry process has a huge section on that, and matching surname, on which is says much less. Removing accents and changing ß to ss, are the main thing for a surname. Even so, matching is probably not too hard, and you can always expect an account number to enhance the match if not sure. You do need to automate this, as the SLA is 60 seconds. You do need to contact the customer (e.g. email) if you match, including early termination fees.

Handling a match response

In many ways this is the hard part as you can get a lot of different answers even for a simple match request - maybe numbers are in a block and some will be ceased if you port one. Maybe there is a number tied to a broadband line - which is a whole new can of worms as losing provider may have a retain option buy you know you are changing to SOGEA which kills the number. Ultimately you need to then pick one, or give your customer a choice, somehow explaining the actually meaning of these options. That can be tricky to do in a clear way.

Bear in mind you may have to continue without using One Touch Switching, as before, as the customer may not want to switch their existing service, or they may have a business service.

Getting a switch order

This is actually really easy, you get told the switch order (which you saved from the match), and you just need to put in your customer database and update the customer (e.g. email). Any migration (e.g. broadband or number port) happens as normal anyway. The switch order messages allow you to record the handover date (plannedDate and activationDate). If not a migrate then that is the same as a customer ceasing by other means. You do need to work out early termination changes, but probably have the processes for that, and billing, all in place from the notice of termination system.

Generating a switch order

Again this is pretty easy. We opted to send the order as part of customer order process, and then any other messages on a daily job - advising if a change of date, cancelled, or completed. The actual ordering system are the same as they were before, doing a migrate or a new install in the same way as normal, but just these extra outgoing messages at the key stages.

Simulator

TOTSCO run a simulator which is crap, but may be worth getting out of the way early. They will allow you to short cut that if you manage a message both ways showing connectivity. We could not go further because their fixed messages were not valid, so we had no switch order to start, so skipped all but one message.

Pre-production testing

The next step is a slight fiasco - you need to team up with a buddy CP who is on the pre-production platform. We did this with two CPs, one of which turned out to be just starting, and it was like watching paint dry. The other was all ready, had a very slick system, and we got through all tests in 90 minutes - most of which was each of us working around our normal systems to fool them that an order had started and finished and so on. But getting TOSTCO to find a buddy CP can take ages, and the whole thing is basically pot luck, there is no checking against a reference system or to the specification.

Oh, and for extra fiasco, we were expected to do 1000 messages. This is not like testing 1000 different addresses or something, just that we did a 1000 messages. A command starting "repeat 1000 ..." was used, and took a few minutes.

Production testing

If pre-production testing was a fiasco, production testing is a joke!

They booked a two hour slot for this. It requires one message. They actually insisted on one exchange of messages each way (4 messages in total).

It took 15 minutes, and the reason it took so long was they were seemingly hand crafting the messages they sent, so each message they sent took some minutes to prepare. This is beyond stupid, if you ask me. They even managed to hand craft an invalid message which meant waiting for them to eventually send a correct one.

But the criteria is just that - a pair of message exchanges, and you are live on production. Once again, no test against a reference system, and no testing to the specification.

Want to do it right?

I got so pissed off I created NOSTCO, which provides an unofficial, independent, free, test environment for One Touch Switching development. It allows you to try a lot of messages, each way, with a lot of different combinations - particularly important for handling the myriad of possible switch match responses.

What is really good is that it analyses each message, and reports any issues, with reference to the specification. It allows you a playground to work through development, testing as you go, from the most basic connectivity, to completing the whole sequence of switch match and order, each way.

It makes a scorecard of message types, so you can see you have good coverage, and allows a range of fixed nasty messages to test your error handling and edge cases.

Not over yet

Of course the real fun will start when we start doing live switches with other CPs.

Good luck!

2024-07-12

TOTSCO Integration testing done

Well, we managed to complete tests with one CP yesterday, with us giving them a lot of hand holding.

But separately, today, I had a 90 minute teams call, with TOTSCO, and another CP, and, well, that was it!

We each had a challenge of forcing our systems to create scenarios - faking installations happening when they had not, and so on. And then forcing errors when errors should not happen. That is why it was 90 minutes.

I learned a few things - for example I had one system sending a datetime not a date, in one specific case we had not tested before, which meant they rejected the message (quite correctly), well done. They too found issues as well, all minor, and sorted on the call.

It was nice working with people who had basically got a fully working system, and it was just slight tweaks and edge cases - polishing things a bit - which they were able to sort on the call. All very professional. Thanks guys.

But that was it, a 90 minute call, and TOTSCO confirmed all tests done each way and all passed integration testing.

Next step, production testing, which is ONE MESSAGE... Oooh, I'm scared!

TOTSCO 66 is guidance, optional

I feel I need to explain this. The TOTSCO call today, first I have been on, and wow! But a key point was TOTSCO bulletin 66, which is actual...