RevK®'s ramblings: 2024

2024-07-24

TOTSCO 66 is guidance, optional

I feel I need to explain this.

The TOTSCO call today, first I have been on, and wow!

But a key point was TOTSCO bulletin 66, which is actually quite sensible guidance.

So what is the problem? It is guidance, not mandatory. CPs don't have to follow it even.

So let me try to explain.

If ANY CP follows that guidance then ALL CPs have to change how they create a source correlationID to be totally unique.

The API specification does not require that, so it is a real change.

If some other CP does not do that, the recipient CP, following the guidance, may assume a duplicate message and discard it.

This is non trivial.

2024-07-23

OFCOM disinterest in OTS?

OFCOM sent a mildly threatening letter about One Touch Switching and the impending deadline.

I replied by email, two, nothing. So wrote, and nothing.

So now this.

We'll see if they reply.

Update: OFCOM want a call, yay!

Update: Useful call, OFCOM listening.

OK, being constructive

I am pondering what could be done right now. So some thoughts...

Firstly - this is not simply pedantry, or my getting pissed that I misread the spec - this is not hypothetical. Working with other CPs, and monitoring testing on my NOTSCO test platform, yesterday, half the CPs testing were falling foul of the latest checks on source correlationID added because of TOTSCO bulletin 66. I see other CPs are running in to all the issues I have raised with the specifications. Most of the errors my test platform picks up would not be picked up by the existing TOTSCO testing process.
I feel some people with some clue how to write a clear specification and understand the challenges of coding systems to meet such a specification are engaged with TOTSCO and taken seriously. I can help (though probably not for free - though have I made many suggestions anyway).
I feel the specifications need to be consolidated and simplified and put in one place - there are too many parts, in different places, some freely available, some under a login on the control pages, some XLS, some a web page, some PDF, and so on, it is a total mess. Clear and complete set of specifications in one place.
TOTSCO need the specifications updated, and kept updated, and a process to notify updates to all CPs involved, so they can ensure compliance. This means proper change controlled notices of what has changed, not a random bulletin that assumes/implies a serious change to a spec that is in a change freeze! Even if this was a weekly spec update with all changes sent to all CPs.
I feel going for 12th Sep is fine - we have to start somewhere, but for the start of full OTS usage, and not a requirement for all CPs to be on line, simply because I am not sure there is time for that. But from that date, all CPs that are live on TOTSCO should offer it as part of their ordering process, related to other CPs that are on TOTSCO.
Some later deadline for all CPs, maybe an even later one for small CPs.
I definitely think a self service test platform is needed for API and OTS with all sorts of scenarios (valid and error testing) and messages both ways, needs to be in place, and a key part of compliance testing. I have one, and I am happy to work with TOTSCO if they want to use it. But it literally took only a couple days to make, so TOTSCO could make one themselves. Testing should be to a reference implementation and against the specification. In practice making this a CP on pre-production (and even live), called TEST, with a control page on TOTSCO to manage tests and replies and logs, would be ideal.
We also seem to lack a way to contact other CPs when live to address issues - and a way for TOTSCO to arbitrate that one CP claims another CP does not meet the spec. A clear spec is needed, but a whole inter CP dispute process needs to be in place - and a reference test system would be invaluable for that.

2024-07-22

TOTSCO moving goal posts, again!

One of the big issues I had in initial coding was the use of correlationID on messages. The test cases showed it being used the same on a sequence of messages, e.g. a Switch Order had a destination correlation which only made sense if it was a response to a Match Confirmation, for example. I was wrong, but not for lack of reading the spec.

The API spec says this: In a source element, the correlationID must always be provided, the format can be anything the originator chooses to support their messaging process but should be sufficiently unique to allow correlation of response with request over a reasonable period.

This makes it clear what purpose the correlation ID has, it matters to sender so they can correlate response with request. It also makes it clear the sender is who chooses the correlationID.

Now, for that purpose a Match Request, and subsequent Switch Order, and Switch Order Trigger could all have the same correlationID. Indeed, arguably, a sender could use the same correlation on all Switch Order related messages because the messages all carry a Switch Order Reference, which can be used to tie the response to a specific order. An obvious choice, and we nearly did this, was to use the actual switch order reference as the correlationID.

Also, there is nothing to stop an originator, when generating a reply, to use correlationIDs differently, as they don't expect a response to that reply, and there is no correlation of response with request. Again, an obvious choice for the various switch order messages would be the switch order reference, as this is the one thing missing from a MessageDeliveryFailure message, and would allow that error to tie to a switch order.

TOTSCO Bulletin 66

TOTSCO just released bulletin 66, on handling received (from hub) messages better, notably on response times and validation, but also on handling duplicate requests. They detail a recommendation that the messages are cached for a while, per originating RCPID and source correlationID, and use this to spot a duplicate.

If a sender chose to use the same correlationID for a Match Request and Switch Order, which is definitely sufficiently unique to allow correlation of response with request as per the spec, the recipient would see the Switch Order as a duplicate message and ignore it, maybe resending the Match Confirmation.

If the sender chose to use the SOR on switch order messages or replies, the recipient would see all messages after the first as duplicates, and ignore them.

So now, if effect, based on just a bulletin, the specification mandates that every message sent (request or reply) has a unique correlationID, something not in the spec. In general this is a good idea, but the API spec should have stated that at the start! It now means the source correlation ID matters to the recipient as well, not just the sender. And they have not changed the spec as it is in a change freeze. Oh, and there is no size limit for a correlationID.

The bulletin does not even actually say the sender correlationID has to be unique, it basically assumes it is and explains how recipients can assume it is for spotting duplicate messages!

Once again, a fiasco.

P.S. Our implementation does unique source correlationID already (uses a UUID).

Also, I have updated the NOTSCO test platform to warn of duplicates, and generate a duplicate as well to test CPs handling of duplicates.

Just to add, the confusion caused by the poor specifications is real. Not just that we were confused by the examples implying a way of working, but I monitor the NOTSCO testing and see other CPs doing similar things, based on the specification, that are going to be problems. I'm just waiting for this new check to kick off and show a CP assuming they can pick source correlationIDs for their own purposes (this did happen later in the day). In fact, looking at logs today (we only keep for a day) I already see duplicated correlationIDs that will break when sent to any CP following TOTSCO Bulletin 66.

This is a bigger issue than you realise!

We originally coded with a way of working with correlationIDs that would fall foul of any CP following bulletin 66. We changed later once TOTSCO confirmed that basically its test cases are wrong.

I am seeing now half of the CPs testing on NOTSCO hitting the duplicate test.

The whole way TOTSCO do testing is two random CPs testing against each other. That would NOT have picked up this at all. So the CPs carry on.

Then, wham, on 12th Sep, some OTS messaging breaks because one of the CPs followed the spec (which has NOT BEEN UPDATED) and one implements the de-duplication in bulletin 66.

The fact TOTSCO do ZERO formal testing against the spec is just a serious problem - that is just irresponsible. I'm amazed OFCOM allow it.

2024-07-21

Bulk ESP32-S3 programming

Programming an ESP32-S3 is really easy.

The S3 has build in USB, which means literally just connecting GPIO 19 and 20 to D- and D+ on a USB socket - not even any resistors! It operates as a USB device out of the box, appearing as a serial/JTAG port. It just works on standard USB serial drivers on linux and MacOS (and I assume, Windows).

Using the ESP IDF tools I can type.

idf.py flash

And that is it, it detects the chip, and flashes the bootloader and code.

No special leads, it is that simple.

Smaller footprint

The only issue is that this all works if you have the complete ESP IDF installed, with its python and cross compiler environment, and your code checked out and built (or able to build). This is not hard, there are simple steps to do this, but it takes a lot of space.

So, I wanted something simpler so I could make a small machine, ideally a Raspberry Pi, that just flashed code. Thankfully, all I need is esptool, i.e.

pip install esptool

And then I can flash using that rather than the whole IDF. It is more complex, e.g.

esptool.py --chip esp32s3 -p /dev/ttyACM0 -b 460800 --before=default_reset --after=hard_reset write_flash --flash_mode dio --flash_freq 80m --flash_size keep 0x0 release/LED-S3-MINI-N4-R2-bootloader.bin 0x10000 release/LED-S3-MINI-N4-R2.bin 0x8000 release/LED-S3-MINI-N4-R2-partition-table.bin 0xd000 release/LED-S3-MINI-N4-R2-ota_data_initial.bin

But that is simple to script. One tool installed and the binaries from my repository, and job done!

One device after the next

The challenge is that I want to do bulk programming - i.e. flash a device, get clear confirmation it worked, then just plug in the next device. I don't want to run a command each time.

Getting confirmation it works is easy as all my boards have an LED, usually a tiny 1x1mm WS2812 colour LED, and that starts blinking as soon as the board starts. Indeed, the code is signed and checked on boot, so if any issues flashing it won't start.

Indeed, where I have done this I have had there separate instances running and 3 USB ports and leads, so I could plug in one after the other, unplugging when I see it is flashed and running. Really slick!

What I was doing was

idf.py flash monitor

This flashes, and then runs, and monitors serial output (which can be useful if there are additional diagnostics to show, but the main indicator is the on board LED).

The problem is you then have to kill the monitor for each board (ctrl ]). Even just disconnecting USB appears to wait for device to reconnect. I created a convoluted bit of C code to run monitor, and check output, looking for the string it gets for a new device, and exit. That way I could flash, and then run this, in a loop. Works well.

The problem is that, once again, this is using the whole ESP IDF just to run the idf.py command. And it seems esptool does not do a monitor function!

My own monitor code

In principle it is really easy to make my own C code to open the USB (serial) port directly, and set DTR and RTS appropriately to reset the board in running mode (rather than bootloader mode).

This worked perfectly on my Mac. Some simple code, waits for the right string to indicated a new board, and exits. It also does not need the whole ESP IDF to run.

But no!

The first issue is that the ESP32, with no code loaded, seemed to trip the power on the USB port. It is odd, and maybe the regulator I am using creates just enough of a power spike, or something (never bothered my Mac), I don't know. The fix was a powered USB hub.
The next issue is that once code is loaded, even with a powered USB hub, it seems the start up with WiFi is enough to then trip the power, so it constantly resets and does not blink the LED.
I finally found a power hub that just works with linux.

But there is weirder!

The other weirdness was that on the raspberry Pi, it seems it would not play properly with RTS and DTR and constantly came up in bootloader mode regardless. I simply could not get it to play, it was like DTR was not being set. The only difference seems to be it is using an OTG serial driver. On two separate bigger linux boxes, using a different driver, it works as expected (and ends up in a boot loop, as I said above).

I don't know how one can change the serial driver on a Pi, suggestions welcome (google did not help me).

2024-07-20

TOTSCO - the top level - ordering

This should give you some idea of the issues with a simple matter of providing a broadband service. Bear in mind the broadband service may have a linked telephone service - i.e. be ADSL or VSDL on a phone line, and the customer may, or may not, want that number to carry on working some how.

It used to be we could take over the broadband and leave the telephone alone, or, we could take over number and broadband as a BT line, or we could take over broadband and port the number to VoIP.

It is more complicated with the retirement of old fashioned phone service - we cannot move the line to broadband with us on a telephone line any more, we have to move to something called SOGEA or SOADSL, which is a broadband service with no telephone service on the line. So we have to offer customer choice to lose number to move to VoIP.

So lets look at some of the combinations we have to handle, and do One Touch Switching for...

It could be a service that is totally different, like Starlink or something - we provide new broadband and OTS co-ordinates the cease. Simple.
More likely, BT/Openreach broadband and BT/Openreach phone service using a BT number range number. Yes, that specific set (regardless of resellers, which may not be the same for broadband and telephone) is special as we can do an integrated port moving broadband and porting phone as one order in to BT. As you can imagine working out it is this exact combination can be tricky, and end user may not know.
Could be BT/Openreach broadband, and a BT/Openreach phone line, but not a BT number range number, in which case we migrate the broadband and port the number separately as we cannot do an integrated port.
Could be BT/Openreach broadband, and MPF phone line, in which case harder to check, and we can port the number separately as we cannot do an integrated port.
Could be BT/Openreach FTTP with and associated phone number which may be even VoIP, but is linked at the BT account so would die if migrating broadband. I think that has to be a separate number port, but not sure - it may allow an integrated port if a BT number range. We'll have to test that one to be sure.
Could be BT/Openreach broadband and BT/Openreach phone service, but the new service is FTTP, so a separate physical service. This can be coordinated to allow old broadband to be ceased but leave phone line in place, at least for now.
Could be BT/Openreach with no phone number associated, yay! simple migrate.
Could be CityFibre which won't have a phone number, yay! simple migrate.

For the OTS, somehow we have to explain the options so they can make an informed choice!

Porting the number adds an extra step too, now.

The OTS match for broadband using number to identify it may (or may not) come back with an option to retain/cease, or we could do the OTS with IAS and NBICS "port" request, making one "switching order" for broadband and number port, if that is offered as an option.
The OTS match may or may not mention a number linked to the line, depends if the reseller of the broadband knows if there is a number and what it is - the number could be a totally different reseller. But we may be able to work out the service has a BT/Openreach number based on the broadband checking in BT. If the customer knows the number we may be able to do an integrated port on the broadband. It is not impossible for neither the old broadband retailer, nor us, to know there is a number, and then that number gets zapped - so we have to ask the customer if they are sure, regardless.
If the broadband OTS does not have a number port on the same switch order, we'll have to do a secondary OTS for the number port, possibly with different retailer, for the same address. Then we have to manage and track two switch orders. We probably need to do that even for the integrated port option.
Either the broadband, or the number, or both may not be able to do an OTS check if the service is a business, or the retail provider is not on TOTSCO yet, so we have to handle that.

At the end of the day, this is a couple of extra pages of stuff to fill in on our order forms for customers now! It also adds new ways for things to go wrong.

The very small light at the end of the tunnel is the telephone number porting OTS should advise the Network Provider and the CUPID which should allow the port to go smoothly. We're looking forward to testing that!

2024-07-19

TOTSCO journey (to help other CPs)

I have done many blogs, so this is a summary, mainly aimed at other CPs. I hope it helps.

Gotchas

I am going to try and cover some gotchas, but a lot can be avoided by checking out ProposedChanges.md

Get on board

If you are a communications provider doing residential, fixed location, broadband or telephone service you have to do this shit, sorry. OFCOM rules, and so law. Get to totsco.org.uk and sign up. There is a deadline of 12th September 2024 for all retail CPs covered by this to be ready.

You may, instead, be able to use a managed access provider, but that sounds iffy to me as they need access to all your data, and you probably still have to do most of the work to integrate with your systems I expect. To be quite honest, it is not that hard.

How long?

I spent a week on this initially - much was cleaning up our address records, and reading specs, before coding it. I made a test system as part of this, but nothing like NOTSCO. I have done various small bits since, and spent several days integrating to our ordering, but mostly it has been waiting for TOTSCO. I spent another week making the NOTSCO platform from scratch, and wish I had done that first. Total elapsed time a little over 5 weeks. I'll not doubt do more once we are live and we encounter some unexpected edge cases.

The specifications

There are a load of specifications, and a few you need to read are:

API specification
OTS message specification
Industry Process
Error codes list
Network Operators list
Example messages

But there is a lot, and not very well structured, or consistent. I suggest also checking ProposedChanges.md

API to TOTSCO

The basic steps in talking to/from TOTSCO are not hard. They have a choice of authentication and we chose OAUTH2. After the basic OAUTH2 you are basically sending JSON payloads each way to/from the hub, and the format is simple with an envelope and payload. There are loads of tools and libraries to handle such APIs.

Watch out for my correlationID section in ProposedChanges.md as this caused me to waste a day, at least, as I assumed the examples implied a longer term correlationID. You need one on every message you send, and may as well be unique to each message, so you can correlate a reply. You need to correlated the reply for a match request anyway, and should for other messages as they can come back as messageDeliveryFailure, so can't just use the switch order reference to handle the reply.

OTS messages

The next level is the OTS switching messages. These fall in two parts. The match request and reply, and then the set of switch order starting, updating, cancelling, and triggering.

You may want to make a system for sending and receiving these, and then integrating that in to your ordering and customer management systems. I chose to make a library with some good command line options for testing and then a higher level integration of that library in our ordering system.

Making a match request

In principle this is actually quite easy, there are not many options or fields. The only hard part is correctly forming an address, and if you can, using UPRN. For IAS that can be it. If you don't get a match you can ask customer for account number and circuit ID and add that. For number porting you have a telephone number, which is the crucial component. We use BT/Openreach address tools which make this relatively easy, and provide a UPRN, but O/S have an AddressBase product which is good as well.

Matching an address, and sending a response

This is slightly harder, the main gotchas are address matching - the industry process has a huge section on that, and matching surname, on which is says much less. Removing accents and changing ß to ss, are the main thing for a surname. Even so, matching is probably not too hard, and you can always expect an account number to enhance the match if not sure. You do need to automate this, as the SLA is 60 seconds. You do need to contact the customer (e.g. email) if you match, including early termination fees.

Handling a match response

In many ways this is the hard part as you can get a lot of different answers even for a simple match request - maybe numbers are in a block and some will be ceased if you port one. Maybe there is a number tied to a broadband line - which is a whole new can of worms as losing provider may have a retain option buy you know you are changing to SOGEA which kills the number. Ultimately you need to then pick one, or give your customer a choice, somehow explaining the actually meaning of these options. That can be tricky to do in a clear way.

Bear in mind you may have to continue without using One Touch Switching, as before, as the customer may not want to switch their existing service, or they may have a business service.

Getting a switch order

This is actually really easy, you get told the switch order (which you saved from the match), and you just need to put in your customer database and update the customer (e.g. email). Any migration (e.g. broadband or number port) happens as normal anyway. The switch order messages allow you to record the handover date (plannedDate and activationDate). If not a migrate then that is the same as a customer ceasing by other means. You do need to work out early termination changes, but probably have the processes for that, and billing, all in place from the notice of termination system.

Generating a switch order

Again this is pretty easy. We opted to send the order as part of customer order process, and then any other messages on a daily job - advising if a change of date, cancelled, or completed. The actual ordering system are the same as they were before, doing a migrate or a new install in the same way as normal, but just these extra outgoing messages at the key stages.

Simulator

TOTSCO run a simulator which is crap, but may be worth getting out of the way early. They will allow you to short cut that if you manage a message both ways showing connectivity. We could not go further because their fixed messages were not valid, so we had no switch order to start, so skipped all but one message.

Pre-production testing

The next step is a slight fiasco - you need to team up with a buddy CP who is on the pre-production platform. We did this with two CPs, one of which turned out to be just starting, and it was like watching paint dry. The other was all ready, had a very slick system, and we got through all tests in 90 minutes - most of which was each of us working around our normal systems to fool them that an order had started and finished and so on. But getting TOSTCO to find a buddy CP can take ages, and the whole thing is basically pot luck, there is no checking against a reference system or to the specification.

Oh, and for extra fiasco, we were expected to do 1000 messages. This is not like testing 1000 different addresses or something, just that we did a 1000 messages. A command starting "repeat 1000 ..." was used, and took a few minutes.

Production testing

If pre-production testing was a fiasco, production testing is a joke!

They booked a two hour slot for this. It requires one message. They actually insisted on one exchange of messages each way (4 messages in total).

It took 15 minutes, and the reason it took so long was they were seemingly hand crafting the messages they sent, so each message they sent took some minutes to prepare. This is beyond stupid, if you ask me. They even managed to hand craft an invalid message which meant waiting for them to eventually send a correct one.

But the criteria is just that - a pair of message exchanges, and you are live on production. Once again, no test against a reference system, and no testing to the specification.

Want to do it right?

I got so pissed off I created NOSTCO, which provides an unofficial, independent, free, test environment for One Touch Switching development. It allows you to try a lot of messages, each way, with a lot of different combinations - particularly important for handling the myriad of possible switch match responses.

What is really good is that it analyses each message, and reports any issues, with reference to the specification. It allows you a playground to work through development, testing as you go, from the most basic connectivity, to completing the whole sequence of switch match and order, each way.

It makes a scorecard of message types, so you can see you have good coverage, and allows a range of fixed nasty messages to test your error handling and edge cases.

Not over yet

Of course the real fun will start when we start doing live switches with other CPs.

Good luck!

2024-07-12

TOTSCO Integration testing done

Well, we managed to complete tests with one CP yesterday, with us giving them a lot of hand holding.

But separately, today, I had a 90 minute teams call, with TOTSCO, and another CP, and, well, that was it!

We each had a challenge of forcing our systems to create scenarios - faking installations happening when they had not, and so on. And then forcing errors when errors should not happen. That is why it was 90 minutes.

I learned a few things - for example I had one system sending a datetime not a date, in one specific case we had not tested before, which meant they rejected the message (quite correctly), well done. They too found issues as well, all minor, and sorted on the call.

It was nice working with people who had basically got a fully working system, and it was just slight tweaks and edge cases - polishing things a bit - which they were able to sort on the call. All very professional. Thanks guys.

But that was it, a 90 minute call, and TOTSCO confirmed all tests done each way and all passed integration testing.

Next step, production testing, which is ONE MESSAGE... Oooh, I'm scared!

2024-07-09

TOTSCO - making it work

We (A&A) are trying to make One Touch Switching work, honest.

Given the many posts I have done, and that I am clear how I feel it is a stupid imposition by OFCOM, and badly implemented, you may think I am not trying to make it work. But I am.

Testing

I have created a complete proper testing platform, because TOTSCO don't have one. But I have been monitoring, and finding issues that other CPs have. The use of "VOIP" for "NetworkOperator", instead of "A000", fooled me, and I have confirmed the TOTSCO spec is wrong, as is the example in the spec. I asked that question, got an answer (even though they will not update the spec) and have added details and notes on NOTSCO so testers can see the issue and why it is there. I am constantly working, every day, to ensure my test system is right even when that means getting the broken specification clarified.

Helping "buddy CP"

We are working with a CP. We will work with more. We had been told by the boss of the CP that they were ready, and would be happy to test with us. To our (and his) surprise, they are not close. So I have done a lot of hand holding here. We found more errors in the specification as a result. My view is we are 100% ready, and have been for a month. But getting through TOTSCO's insane testing process is painful and broken. And no, TOTSCO are not paying us to hand hold, and test with other CPs?!

Fixing the specification

Every error we have found, we have told TOTSCO, and tried to get them to fix the spec. They think a change freeze is a good move - it is not! We tried. So the issues are documented on NOTSCO test system to help other CPs understand.

Going live

I am not convinced at any of the other testing stages to be honest.

What would I do (ramp up)?

I would have CPs going live, but making clear to customers as part of the order journey that this is not all CPs yet and they can choose not to be part of it. If they say yes, offer choice of CPs that are on-line.

This means real migrations and ports with real CPs, increasing as each CP goes live. The deadline would simply be for when all CPs are live.

Babysitting

A key aspect I have included in our system is babysitting. I personally (as the lead developer) am being notified of each OTS message, so I can review it. A larger CP may need a team to do that.

We also make it optional for customers - we have to - they may not want to cease an existing service at all, so the process has to be optional. But it means any issues also allow a more old school migration or new service instead.

This idea is simple - there will be edge cases - there will be errors - we will have errors - but we can proposed changes and review them and make them live in an agile way. With NOTSCO we can test those edge case errors. Indeed I can add test cases to NOTSCO for others as a result.

We lack contacts to other CPs, this will be fun when live as I will raise with TOTSCO every weird or wrong OTS message we get from another CP and asking them to put us in touch with other CP. What would have been neater would have been to provide all CPs with an email to query OTS issues during deployment. Why is that not a thing?

Ideally we would find all issues during a ramp up, and if that meant changes to TOTSCO specifications, they would be done and notified quickly.

We can make it work!

We can, I am sure, but it is being so badly managed right now (in my honest opinion) it will be a lot more work than it should be.

TOTSCO change freeze

You are approaching a deadline, one that is legally important.

Hundreds of developers are working to meet that deadline. They need to interoperate on the impending deadline.

Your specification has errors, contradictions and vague definitions.

You choose:

Encourage queries, make clarifications, advise all companies of these clarifications and updates in a timely agile way updating the specification.
Change freeze the specification so all companies make their own mind up, get confused, and reach deadline in an incompatible way.

Which did #TOTSCO choose?

2024-07-05

TOTSCO, telcos, a little help please!

There are 47 other companies on the TOTSCO pre-production platform right now. We have been waiting weeks for a buddy CP for testing. We'd love to get more testing with anyone.

Can any one of you spare a few minutes to do some testing?

You don't need to book anything with TOTSCO for this, if you are on pre-production platform, we can exchange messages, and if we exchange the require messages we can complete this stage of testing.

Try a match request to us maybe? I have set up a line on our system, with these details, for testing.

Service: IAS

RCPID: RVWJ

Surname: STARMER

10 Downing Street

LONDON

SW1A 2AA

That should get you a valid match confirmation. Try with surname SUNAK, and you should get a match failure. [I thought it more amusing that way around]

If you get a confirmation, do send a switching order, update, cancellation/trigger as well.

Thanks.

Update: We got a test pretty quickly, which is nice. I got the postcode wrong initially, D'Oh, but the match request included an account number and UPRN. It is a concern that they were included (I did not post a UPRN initially and what was sent was wrong). It suggests the sender expects to send an account and UPRN in all cases, when neither should be mandatory. So interesting test, thank you very much for that.

Comment: Yes, that is all you need to know if there was a broadband service you wanted to port to a new provider under the new system.

Update: I did not include a UPRN or Account number as I would hope CPs can cope without these from a customer. We cope without them in matching. But as TOTSCO don't define it, we also cope if they are present but an empty string!

TOTSCO Tick Tock

I may sound like a stuck record, but I learn more as I go, so updating on this seems sensible. I hope it helps other CPs.

TOTSCO do publish the test process, here. But I'll summarise.

A simple connectivity test, to check connectivity, and send some dummy message responses.
Integration testing with another CP, exchange a number of messages of different types.
Production Implementation testing.

To summarise the problems so far.

Simulator

It is meant to do basic connectivity tests, but did not actually pick up the one issue we had that we were too slow responding (a simple apache config tweak fixed). So failed in its one job.

We could not complete tests as the responses were not valid. Heck, we had to bodge things to even send a message as the simulator does not meet the spec for the URLs used. This meant we would not then send messages to progress a switch order, as they wanted, because we had (correctly) rejected the invalid messages they had sent, and had no switch order to progress. Thankfully that step was not mandatory.

Integration Testing

This purports to be more comprehensive testing, but it has a lot of issues.

It is testing with a buddy CP, but it seems it can take weeks to have one assigned, at random. We short circuited this eventually be agreeing with another CP we know. But they were not ready. What is worse is that we now know that at the same time as we are waiting weeks, other CPs are as well, which makes no sense?!?!
The buddy CP is doing testing with us, for free(!), if they are ready. They may not be ready. Even if they are, we are doing tests against their interpretation and implementation of the OTS specifications. It is not testing to a reference implementation or against the specification.
They want us to do all 15 message types (plus the 16th messageDeliveryFailure). One issue is that this is contrived. Our system is design to send valid messages as part of a switching process, integrated in to our management systems (with just the tiniest tweak for testing to not actually action a cease or migrate at that key point). This means getting failure response to some messages will not use our normal system, because our normal system would not send incorrect messages.
They then want at least 1,000 messages - why? These are not real switch orders. It will literally be a repeated match request sent 1,000 times. A totally pointless step.

I have had to make the system allow me to send bad messages in some ways in order to get the Failure responses. This means I am not testing the actual OTS system we have made, I am bodging it! If there was a proper test system, one could set up the bad responses even for valid messages so as to test, and to generate bad incoming messages to test error checking. But if you have two CPs that have set up systems correctly, they would not generate bad messages and therefore prompt error responses as a result. You actually need a buddy CP that is set up to deliberately do testing, for free(!).

This is where we are still - I asked for another buddy CP a few weeks ago, and no joy yet, but the original CP may be closer to being able to do basic tests now. I hope so. It could allow us to finally get past this step.

Production Implementation testing

This is the final step before able to go live.

We have to book a test slot 8 weeks in advance. Why in the name of sanity would we have to do that? I mean if the test slot meant tying up TOTSCO staff for hours to go through a series of complicated tests, I could understand - but this is the kicker...
The test is one message exchange. Just one. I don't see how this even takes up TOTSCO staff time. It should not. It could be automated - I fill in details - I send a match request to BT or someone, and get a response, done. Why on earth is a test slot even needed in the first place, let alone booking 8 weeks in advance.

2024-06-29

TOTSCO correlationID

RESOLVED! See below!

My latest concern is understanding TOTSCO specification. This may be that I have mis-read or not read enough. I am fully prepared to accept I have this wrong. It came up because the buddy CP and myself read it differently.

Messages each way have a source and destination correlationID. This is necessary to allow a response to be correlated with a request. An initial request does not need a destination correlationID (indeed, should not have one), but needs a source, and the reply needs a destination correlationID matching that source (and arguably maybe not a source of its own, expect it is mandatory §2.1.5, except it is not §2.1.8).

My initial interpretation was that each message type that was a Request would have a response that is a Confirmation or a Failure. And that the Request/Confirmation or Request/Failure would need matching correlationID so the response could be matched to the request, but that was it.

Indeed, all of the messages and responses that progress a switching order also contain a switchOrderReference, so no actual need for correlationID at all anyway in those.

My code would send a Request and wait for a response, using the correlationID to match the response. This is synchronous in the customer order process where the SLA for a match request is 60 seconds. We make the customer wait for the response up to 61 seconds.

But then I saw the published TOTSCO test cases, and they all had a destination correlationID for the ongoing messages, the residentialSwitchOrderRequest, for example.

This only made sense if the whole sequence, such as the following, were all a single message flow with a consistent set of correlationIDs each way for the whole sequence.

residentialSwitchMatchRequest
residentialSwitchMatchConfirmation
residentialSwitchOrderRequest
residentialSwitchOrderConfirmation
residentialSwitchOrderUpdateRequest
residentialSwitchOrderUpdateConfirmation
residentialSwitchOrderTriggerRequest
residentialSwitchOrderTriggerConfirmation

If that is the case I have to hold correlationIDs much longer, and associate with ongoing switch orders. I spent many hours re-working the system to do just that. This had issues with the possibility of delayed/repeated messages, which can happen. A reply may be to an earlier message with the same correlationID. I'd far prefer the previous interpretation where each Request has a new and unique correlationID which has to be quoted in the single corresponding response (Confirmation or Failure). It would be simpler and easier. But the test case examples make it clear that this is not the case, which is messy and a lot more work.

I have now asked TOTSCO to clarify. I have not had a reply yet.

So, even though I did all the extra work, I am happy if they come back and say it is for each message pair distinctly. But they must update the specifications and examples and test case to make that clear, as it is a lot more work to track these over a complete (multiple days, weeks) switch order process than over a simple message pair.

For now my code does both - it tracks and uses consistent correlationID for the whole sequence of messages, but accepts new correlationIDs for each part of messages if that is what we get.

Update: "The specification does not call for either option to be a requirement, but our expectation and the behavior [sic] we have seen so far in testing is that the second option is being applied by users. There is nothing to stop a CP from wanting to use the same correlation ID throughout a whole switch journey, but the important thing is that they cannot expect their counterpart CP to follow the same behavior [sic]."

This is typically not helpful. If even one CP can expect / require the destination correlationID for a residentialSwitchOrderRequest to be their source correlation ID from previous residentialSwitchMatchConfirmation then that means all CPs will have to track correlationIDs through the sequence else they will not work with that CP. If a CP cannot expert / require that, then no CPs need to do that. The spec needs to say one way of the other. Saying "The specification does not call for either option to be a requirement" is a useless response!

Update: Finally a straight answer - I wasted a day making my code work the same as the test cases, FFS.

"We would like to inform you that, according to the specification, a switch order request is not seen as a response to a match confirmation. Additionally, the TOTSCo hub does not require users to include a destination correlation ID in any request message."

2024-06-28

Will TOTSCO be ready?

The One Touch Switching should be live 12th September. Will the "industry" be ready?

I am not sure.

We are on the pre-production platform now, doing integration testing. There are 47 CPs on the system, including us. And yes, please, any other CPs on there try sending us match requests. And if you need more testing try https://notsco.co.uk/

So I tried sending a match request to each.

The responses were interesting. A lot did respond, which is good, but what is fun is the range of different errors. This is a reflection of how badly the specification has been written. All should have failed to find any service for the name at the address. But the actual error codes and error texts varied a lot. If the specification was good, the response would have been consistent. It is not. Fun!

Quite a few did not respond, fair enough, they may only have their pre-production on line for testing.

Some failed with delivery timeouts, and one with an invalid API Key!

I really am not sure this will all be working. I mean, I think we are 100% ready according to my reading of the spec, and if I have the spec wrong, I am 100% confident I can address that within minutes. But I am not sure of others.

My biggest mistake today was finding apache had a weird 5 second delay. Seems I am not alone if you google that, and a simple fix for it (Content-Length). The CP we are working with may have the same issue, but I am not sure they have the means to debug at the right level to see and resolve it. I'm glad we fixed this, and embarrassed it was wrong.

What is fun is today TOTSCO also failed to meet their own SLA on response times to messages. No reply on that yet.

But all of this is "nuts and bolts" of messaging, and nothing close to the high level issues I fully expect to stem from the whole system. CP to CP messages going wrong has a whole new level of possible issues, and I am not sure we are close to tackling those.

Wow, and one replied after 4 minutes, and replied twice!!! (the SLA is 60 seconds). Their reply had incorrect auditData, and incorrect content in the payload!

2024-06-26

TOTSCO, gets worse

Seriously, this is bad.

TOTSCO have specifications for the whole process, but they are made of cheese. They don't even specify such fundamental things like the basic data types for something like an RCPID (Retail Communications Provider ID). I have argued with them, as one spec does say it is "4 alpha characters, not starting A", but they dismiss this as not actually the spec of an RCPID, and seem to have no issue with not having a specification?!?!

To be clear, I would expect it to be something like: "An RCPID is assigned by TOTSCO, and is 4 alpha characters not starting "A", or the 6 character string "TOTSCO". In JSON it is a string type value. By convention it started with an "R", but this is not a requirement and should not be assumed.", and I would even like them to reserve "TEST" as a special RCPID. I'll help them write a spec if they ask!

A clear specification to which all CPs can refer is essential. Heck, we are used to this with RFCs. The RFC is the reference and who had got it right or wrong is by reference to the RFC.

But what is worse is the whole testing and integration process!

There seem to be these steps:-

A really simple messaging test (their simulator). It is flawed, but checks basic OAUTH2 at least.
A CP to CP integration test using their pre-production platform. <-- WE ARE HERE NOW!
Then live!

At no point is anything tested to the specification!!!!

I am not sure there is even a process for reporting and resolving a CP not following what little specification there is!

This is a serious problem, and as a simple example, we are currently going through the integration testing process with a buddy CP that has already done it. I won't name them, it is not their fault.

The first test

The first test was actually pretty good in many ways - they misread the details I provided and sent a residentialMatchRequest with an invalid account number, and we replied with an error, saying it was an invalid account number format. Yay, a good test.

So, I take that as a huge success of a test.

But no...

Their request was wrong in other ways, and now I see it, I have updated my system. They sent an envelope destination correlationID on an initial message which is not according to the specification. We mistakenly used that in our error reply. Oddly TOTSCO sent us a messageDeliveryFailure even though the other CP got our message, and we then barfed at the correlationID on that, because it was not one we issued!

So why TOTSCO sent the messageDeliveryFailure is unclear. But the other CP got it wrong in the original message anyway. What is worse is at least one messageDeliveryFailure is incorrect as well, according to the specification as it had no source correlationID, which is mandatory.

So at this point, we had a few checks missing, but the other CP had their message slightly wrong. They are the ones that have passed integration testing and are sending a wrong message to us. They fixed it and tried again, but TOTSCO then failed to deliver the message to us, which looks like another TOTSCO error.

Naturally my NOTSCO system picks up this stuff now.

Working with them

To be clear, we are working with the other CP here, we want to make it work.

Update: They had not gone through integration testing, which suggests they have been waiting at least a month for someone to buddy with, which suggests yet another serious problem in the process!

So, the score so far...

Other CP, 1 error (minor), fixed.
Us, 1 error (not handling their error well), fixed.
TOTSCO 2 errors, still awaiting a reply.

Update: Not a peep from TOTSCO all day so far, formal tickets raised.

Update: After raising tickets, I have some replies. They claim we did not respond within 2s, but my logs show no request, so some packet dumping next.

Update: One reply is interesting - their invalid message is apparently correct as two parts of the specification contradict each other.

Update: They said we did not respond in the 2s SLA, but when I asked for the SLA it states 3s (after up to 1s connection time), so no idea where the 2s came from.

Update: and wow...

Don't trust apache!

This may be of use to other CPs here. The SLAs are tight, they want a response (at http level) within 3 seconds.

It is run as an apache CGI function executable. It responds to stdout with Status, Content-Type, and content (JSON), and exits. That should be it. Simples!

My code was responding quickly, indeed, usually well under 100ms. This was as measured in the code, and measured from an external connection (NOTSCO).

However, TOTSCO were still struggling and saying we were timing out. Very odd indeed, so I did packet dumps to prove them wrong.

To my shock, the packet dump showed a 5 second delay in the middle of the TCP.

After some experimentation, noting TOTSCO send Connection: keep-alive, I eventually found that if I sent a Content-Length, then the JSON, apache no longer fucked about, and responded instantly.

I can only assume some persistent connection thing - which is not usually very good with CGIs like this. But even so, having closed stdout and exited, I expected apache not to wait.

So, heads up, that 3 second timeout SLA can catch you out!

2024-06-12

NOTSCO (Not TOTSCO) One Touch Switching test platform (now launched)

I posted about how inept TOTSCO seem to be, and the call today with them was no improvement.

It seems they have test stages...

A "simulator" to prove basic connectivity, well, sort of. See blog!
Pre production (i.e. live with another CP, but not testing against the specification in any way).
They may have a wider general pre-production stage as well.

They seem to be missing the obvious, a proper simulator platform that can simulate communications with another CP using TOTSCO, both ways. This has the aspect that the testing is against the spec, not against other CPs and their interpretation of the spec. It would be something to use whilst developing OTS for an ISP, and before going on to preproduction testing.

Missing link

So how do we address this missing link, a platform to test TOTSCO as if talking to another CP, but without actually doing so. Testing against the specification?

Well, we, like other CPs, I am sure, made some simple test systems before going to TOTSCO. But external testing is invaluable. Even if the external systems have it wrong in terms of following the spec (as long as they will fix it), they won't have the same errors as you have. The best external test would be TOTSCO, making a proper CP to CP simulator system.

But it does not exist - so, as you might expect, if you know me, I have made it, for free.

No need to book test slot, just sign up and use for as little or as long as you need.
Configure the responses you want to a match request.
Send a match request with various options.
Send and receive the various messages for a switching order.
Send deliberately wrong messages to test your error checking.
Test as you go, an ideal way to test your code as you develop it.
Logging and reporting messages each way, in detail, with errors and warnings detected, quoting the specification and section that applies for anything it finds wrong.

From a privacy perspective, I am not expecting personal data to be stored, but we are deleting all test at the end of each day anyway. I did wonder about a report download option maybe.

Now launched

It is now launched at https://notsco.co.uk It took me a few days to create all this, about the same as it took TOTSCO to actually reply when we asked to go on pre-production testing (and they still have not actually set that up). Thank you all for your patience.

Discussions, bugs, feature requests - on GitHub please.

I have told TOTSCO about it as well...

2024-06-10

Working with TOTSCO

This is hopefully going to help other small ISPs that will have the same challenges.

As I explained in my previous post, we have to work with TOTSCO to set up One Touch Switching. Well, we are doing that now that TOTSCO actually exists. The new deadline is September, but we want to ensure we are working well before that.

Specifications

The specifications are not too bad. They have a few inconsistencies, which I have fed back to them. But I was able to code the system reasonably quickly. I created my own test system to act like TOTSCO so I could test my code with messages in and out in advance.

The underlying system is, as I say, just a messaging process between telcos. It can use OAUTH2, which is simple, and involves JSON messages each way, which is also simple. I use C and a load of long standing in-house JSON libraries, but for most people they would use some other platform with standard JSON libraries I am sure. It should be pretty simple. Obviously the hard part is integrating which whatever back end systems and processes the ISP uses, oh, and checking data for clean address data for matching services including UPRNs.

Simulator

TOTSCO have a simulator, which is good. It will allow testing against them. It has been two weeks since I finished coding it all, and only just on the simulator, but it is a mess, so far.

The token issuing URL had an invalid certificate (wildcard, but one level too high). I ignored that to get further testing.
The directory URL did not work (404). This provides (or should provide) the list of ISPs, basically.
The messaging URL simply said "Error connecting to the back end".

Well, that is not a good start, but chasing up, after several days they finally want me to check I am using the correct URLs. Good thing to check, but I was, as per the spec.

They fixed the token certificate, good, but the reply did not say they fixed it. The new cert now uses a different CA that libcurl does not know, or some such, which is fun. But at least is valid.
They told me to use the directory path but on the token issuing host, which makes no sense. Re-reading the documentation it certainly implies the directory URL is an "API" and so you would expect to use the API host. So that is weird. But it still did not work (404 Not Found). I eventually found it works if I add the optional parameter &identity=all. Well, it is meant to be optional, and is a GET form style argument, so how it was giving 404 is beyond me. Interestingly, with that, it works on token host and API host, so even weirder.
They told me to use a path for the messaging that starts /testharness/ which is not as per the specification (which states /letterbox/). So basically the simulator does not follow the specification! Using testharness gets further but a different error this time.
Oh, and the directory I get has RCPIDs (Retail Communications Provider IDs) which don't meet the specs, so, of course, my code barfs trying to put them in the database which was set for 4 characters, as per the specification. So again, the simulator does not meet the specification.

Some progress

Well, surprisingly, we have a quick response now.

They say that the duff RCPIDs are dummy entries. OK, but surely they should at least have correct syntax, as otherwise it is sensible for my end to reject them.
They just say testharness should work, but I have to use specific RCPIDs for testing, good (would be nice if that was documented, maybe I missed something). But they really need to fix it to actually follow the spec and use letterbox.
I got as far as testing a match request and them trying to send a reply. They get an OAUTH2 Bearer token, and then try and post a message, but the message they post does not use the same bearer token I issued to to them, so is rejected.
I can see what they tried to post and it does not have the right source and target RCPIDs or correlationIDs, so again I would reject them if they actually authenticated.
Oddly, after more tests, they are using the right bearer now, but still wrong IDs

The irony here is that part of my coding was to make a simulator for my own testing before going to TOTSCO, and so far my simulator is way better than theirs!

Next steps

I have come to the conclusion that the simulator is actually useless. It does not simulate either the TOTSCO messaging platform (as it does not actually use the right URLs, or provide a sensible directory, or actually do OAUTH2) nor actual end to end messaging (as it does not do source/target RCPID or correlationID correctly).

What really puzzles me is that we know we are not the first to do this, and we know some of the big telcos have done this. So how have other ISPs not ripped TOTSCO to pieces over this stupidity already?

Follow up call

We have had a call. They explain that the simulator is totally dumb, it cannot be told to initiate any messages, and all it does it send one of two fixed replies to a match request (depending on the RCPID to which it is sent). It is meant to test connectivity.

But they want to do more than just two match requests and replies, they want us to send the order, update, tigger, and cancel requests.

This makes no sense, as the match requests test connectivity both ways already. And, of course, my system will not do that as it has not received a valid switch order confirmation reply. The fixed text they send is not valid as wrong RCPID and correlationID, so we don't accept it and don't store the switch order reference. And as such it does not see a switch order we can place or update or trigger or cancel.

I could fake such messages, but that is not testing my system.

They say that if I email explaining this, they will move to pre-production platform. The is the same as live, but with other CPs.

What they seem to lack is any sort of useful simulator that handles messages both ways as if to another CP. This would seen a sensible step before going to pre production testing.

Pre-production testing

We have moved on. Yay!

But the simulator test is meant to test connectivity, and seriously, does no more than that.

So you would hope and expect it simulates the real system.

But no!

The pre-production system has a stupidly big Bearer token, which breaks SQL tinytext. The simulator was way smaller, so not representative of the live system.
The pre-production system can't talk to us, not sending an Authorisation header, WTF!?

I can confirm we have pre-production testing, and now we have to work with a buddy CP to test. They spent a week not finding one? So we suggested someone, and we are now ready to send and receive messages to complete the integration testing.

This whole process would be literally weeks quicker if they had something like my NOTSCO system.

More challenges

TOTSCO seem to see no issue with the fact they have not defined key data types, such as an RCPID. Well, they do, in one document, but they refuse to follow that spec and insist they have not specified. How they can even start without specifying key data types is beyond me.

2024-06-07

One Touch Switching

OFCOM have come up with a few things that are perhaps a tad questionable in terms of their benefit or practical application (in my personal opinion, of course). Sanity checking CLIs is one which created back scatter and broke useful services, but putting that aside, the latest is "One Touch Switching".

So what is it?

The concept seems relatively simple - a residential/consumer with a fixed location broadband (i.e. internet access) or telephone ("Number Based Interpersonal Communications Service") should be able to easily switch to a new provider. They should be able to do it as "one touch", i.e. their one order with the new provider.

Does this make sense?

Well, maybe. From a consumer point of view, for many people, the fact that moving from one "Openreach back end" broadband provider to another is different to moving from one technology to another, and may be confusing. Fair enough.

It is different for a reason - if you have a broadband service provided over Openreach based copper (or worse, aluminium) wires, you can change provider by the new provider working with Openreach to change what is attached to those wires and the ISP to which it is routed, and pretty much seamlessly move from one ISP to another. Of course ISPs vary, some don't even have IPv6, and some use CGNAT, and some filter or log stuff, so not really "switching", but OK.

But if it is a different technology, e.g. moving from VDSL on wires, to some radio (WiFi) service, or Starlink, or Virgin, or mobile, or, well, anything that is a different technology, the process is different.

But it is not complicated! It is order new service and cease old service. If you have any sense you arrange an overlap to ensure new service works well for you before old service stops, as it its not the same "wires". (and no, you cannot easily arrange that overlap now!)

OFCOM could have mandated that "ceasing" a service has to be simple and easy. That would have made any change of technology simple. They chose a different path.

How does it work?

Well, that's another problem, as OFCOM said to "industry, you have to do this", and expected something magically to happen. It did not, and has been delayed. Eventually some new company called TOTSCO has been created that is co-ordinating it.

This new system is simply a way for one telco to talk to another, with some quick, well defined (ish), messages to handle the process. Spoiler, it is JSON!

Basically the new provider ("gaining" provider) messages the old provider ("losing" provider) to match a customer and address, and if that all works they can start, and then later finish, a "switch". Old provider is expected to email customer with any early termination charges and stuff, good.

What it does not do?

It does not actually change the switching, migrating, or porting systems in place now. It simply adds a new layer.

If the process involves some migration or porting that happens the same as if ever did. If it does not, e.g. changing broadband from Virgin to Starlink, all its does is coordinate the cease of the old service when the new one starts.

More work for customer!

Our broadband provide and migrate order forms are complex enough, we have to know exact address and what service we can offer, and if migrating from another Openreach service. But now we have an extra layer on top to match the service from the old provider. It saves the customer ceasing the old service if it is a change of technology, but if a migrate then it makes no difference, just adds more that we have to ask and more that can go wrong.

But for some people it may help, especially if ceasing an old service would be hard work. Some ISPs seem to make it hard work. So some good, maybe.

It seems to also stop most "anti-slamming" measures - not allowing losing ISP to cancel a migration now!

The old systems still needed!

However, the new system is only fixed location internet access or telephony, and only consumers. Anything else still has to work as before, business services, and services that are not fixed location. And even for the cases the new system applies, the old systems to migrate and port are still needed to make it happen.

Some hope?

Maybe, just maybe, number porting, which seems to involve a lot of manual work now, could be improved using some new messaging system used for One Touch Switching. If so, that will be good.

The issue here is many VoIP services are not "fixed location", so outside the scheme. We have had lots of issues with people porting numbers to us where the "address" did not match, when in fact the losing providers idea of "address" is years old before it was moved to VoIP. The new system simply does not apply to non "fixed location" services, so that will be no help at all. A system like mobile ports, using a "PAC", may be way better, and not location dependent.

For us, porting a telephone service, from a fixed location, it may help, as it may confirm address match and confirm losing access provider, so ensure porting (which still has to use the same old system) may be more reliable. We hope so.

What's in a surname?

I mentioned a lack of any means to avoid "slamming", forced change of ISP/telco. This could be someone hijacking customers, or some end user being malicious and migrating someone's service for fun our malice or fraud.

The one thing the new system expects is a match of surname. They have a cryptic requirement to remove accents, but that is messy, depending on language and alphabet, simply "removing" an accent is far from "equivalent" to non accented. But we have done that in a crude way. But we do have to match surname.

So we have allowed customers to set the surname on their broadband services. This is not for VoIP as our VoIP is not fixed location, so will never match for One Touch Switching anyway, and needs old school porting out.

What I have now put on the web site re slamming is:-

For a long time we have operated an anti-slamming option where you tell us in advance that you do not wish your broadband to be migrated to a new provider. You could then change that at any time.

However, the new One Touch Switching system works differently. We will no longer be able to reject switching. However, to start switching the new provider needs an address and surname to match. They can start a switch process in BT without, but this is less likely as the normal process for consumers, and probably most businesses, will be One Touch Switching.

Because the surname has to match, we now allow you to edit the contact name on each line you have with us. Your name is what you want it to be, so picking any name for any circumstance is your right, and we have to respect that and allow you to change your name under GDPR, even if only on that very specific part of our system - the contact name for a broadband service.

If you change your surname, even if it is to PSJKHGJGEXC, then that is your choice. And any One Touch Switching match request would fail unless using the surname PSJKHGJGEXC.

Obviously this is meant to be for your surname not really as a pseudo password, but, well, it is up to you.

2024-05-30

Hot tubs are expensive (again)

Yes, my hot tub is expensive.

Our whole house total power consumption was, typically, 55 to 60 kWh per day. Which is a lot. I have some excuses, servers in the loft, air-con for heating and cooling in various ways, and, of course, the hot tub.

The average hot tub usage 20kWh per day.

Simple change

The simple change anyone can do is insulations. The hot tub bucket has some foam coating stuff to insulate, but there are a lot of pipes connecting and holding (hot) water. These are inside some simple panels, and are not insulated.

The first surprise was how much difference the panels make. They are just thin fibre board of some sort, not obviously designed to insulate.

This is previous normal hot tub power profile :-

As you see, it is high when heating in the morning after being off all night, and when in use, but when idle is around 25% duty cycle maybe.

We removed the panels (to help turn it around, and ready for installing a heat pump). This was a surprising difference :-

The duty cycle, when not in use, when idle, was more like 75% or more. I emptied and refilled the tub from cold and it took 24 hours of full power to get to temperature. Yes, the lid was on.

This shows the side panels make a massive difference!

You can see why!

So what is the simple fix?

Lagging, multi layered loft insulation in fact, and a lot of silver tape, and quite a few hours.

The problem is I don't know how much this has helped, but it was done on the same day as the heat pump conversion - two changes at once. But it is a cheap change and I bet it helped a lot.

I should have done this years ago!

More expensive fix

The more expensive fix is a heat pump conversion. I spent £2299 in total on heat pump and installation.

It took a few hours...

It works by sitting in line with the circulation pump and with the internal resistive heater disabled (it actually has a relay to allow it to be used if really too cold for heat pump to work). The heat pump then operates whenever the circulation pump is on, leaving the hot tub to control temperature as normal, thinking it is working the resistive heater.

So, what's the difference

Firstly the power usage is way lower, the total for heat pump and the circulation pump, is around 1kW. Before it was 3.5kW. The other change is the duty cycle, which was lower. But I cannot be sure how much is down to heat pump and how much is down to insulation.

One big statistic is heating from cold, after a change of water.

With no side panels, resistive load 24 hours at 3.5kW. So around 84kWh.
With side panels, resistive load, back in January, 12 hours at 3.5kW. So around 42kWh.
With side panels, insulation, heat pump, 6 hours at 1kW. So around 6kWh.

So what doing the bigger stats say?

Average usage for May, 43kWh/day. I am seeing examples as low as 30kWh/day though. It seems the whole exercise has saved maybe 15kWh/day. But May is disproportionate with over 102kWh of tumble drier not a normal 42kWh due to someone having a broken bathroom :-)

It also means I am now regularly making enough solar, with battery storage, to run the house on overnight charge only, and have next profit on export, even in May, even on some gloomy days!

Last week's total electricity bill was 41p.

2024-05-22

ISO8601 is wasted

Why did we even bother?

Why create ISO8601?

A new API, new this year, as an industry standard, has JSON fields like this
"nextAccessTime": "2023-May-18 04:43:00+0000 UTC"

I mean, pick a lane, why "+0000" and "UTC"?

Why "YYYY-MName-DD" FFS, that is not *any* standard in RFC or ISO?!

I just don't know how they could have come up with that in any sane way.

The xkcd "cat" format would be saner!

(FYI, it is TOTSCO)

2024-05-12

Debugging

There are lots of ways to debug stuff, but at the end of the day it is all a bit of a detective story.

Looking for clues, testing an hypothesis, narrowing down the possible causes step by step.

It is even more, shall we say, "fun", when it is not definitely a software or definitely a hardware issue. Well, to be honest, we know it is hardware related, but it could be hardware because the software has set something up wrong, or is doing something wrong, maybe. Really a processor hang should not be something software can ever do no matter how hard it tries, in my opinion, but in a complicated system with complicated memory management hardware, it is possible that a hang can be the side effect of something wrong in software.

I was going to say that "when I was a kid, software could never cause a hardware hang", but I am reminded not only of the notorious "Halt and Catch Fire" accidental processor operation, but that one could walk in to a Tandy store and type the right POKE command on one of the earliest Apple machines and turn it in to toast, apparently. So maybe there has always been this risk.

The latest step in the "watching paint dry" process of trying to diagnose the small issue we have with the new FireBricks is underway now. It has been a long journey, and it is too soon to say it is over. It will be an awesome blog when it is over, honest.

One of the dangers with software is the classic Heisenbug: a bug that moves or goes away when you change something. We are chasing something which, by our best guess, is related to some aspect of memory access. This means that even the smallest change to software can have an impact. Make the code one byte shorter and you move all the interactions with cache lines when running code, and change the timing of everything as a result. When chasing a big like this, you cannot rule out those being an issue. So a change of one thing may result is a change in behaviour somewhere else. We have seem a lot of red herrings like this already.

The latest test is unusual for us. It is a change to an auxiliary processor that controls a specific clock signal to the processor before the code even starts to run. One we don't currently need. And we are removing anything we don't need, no matter how unlikely it is to be the cause.

What is fun is that this means we have not changed a single byte of the main code we are running.

If this works, and only time will tell, we can be really quite sure it is not some side effect of simply recompiling the code. It pretty much has to be the one thing we really did change.

Being able to test something so specific by a software change is quite unusual.

Data packages

Our old SIP2SIM was "pay as you go", and the new one has monthly capped data packages.

To be honest, people have asked for this for a long time, but as ONSIM are selling us data packages, it makes sense to do the same, at least for now. Monthly 2GB, 4GB, 10GB, 20GB, 40GB. It is also more sanely priced than before.

But, of course, it is not simple.

So, for a start, adding data to a non data SIM, mid month, is a pro-rata data for rest of month at a pro-rata price. So far so good.

But what of increase of data package mid month. My thought on this (and it depends on ONSIM), is we update to new monthly, pro-rata if data started mid month, to new package, and the same for price. Mostly it will be an increase for whole month to new monthly rate and the difference in monthly price.

But what of decrease? Well, I guess, maybe, the same logic could apply, but only if you have used less than you would now have for the month. My thought it no, lowering the package is setting a new lower level for next month. This is far simpler, and no billing implication and no change to this month.

Of course if you then increase again, we have to allow for the fact that this month you are on a higher package than you will be next month, and only consider it an increase relative to that.

This is never simple, is it.

Hopefully we have something soon, sorry for the delay, waiting on ONSIM to do the necessary APIs for us.