2024-05-12

Debugging

There are lots of ways to debug stuff, but at the end of the day it is all a bit of a detective story.

Looking for clues, testing an hypothesis, narrowing down the possible causes step by step.

It is even more, shall we say, "fun", when it is not definitely a software or definitely a hardware issue. Well, to be honest, we know it is hardware related, but it could be hardware because the software has set something up wrong, or is doing something wrong, maybe. Really a processor hang should not be something software can ever do no matter how hard it tries, in my opinion, but in a complicated system with complicated memory management hardware, it is possible that a hang can be the side effect of something wrong in software.

I was going to say that "when I was a kid, software could never cause a hardware hang", but I am reminded not only of the notorious "Halt and Catch Fire" accidental processor operation, but that one could walk in to a Tandy store and type the right POKE command on one of the earliest Apple machines and turn it in to toast, apparently. So maybe there has always been this risk.

The latest step in the "watching paint dry" process of trying to diagnose the small issue we have with the new FireBricks is underway now. It has been a long journey, and it is too soon to say it is over. It will be an awesome blog when it is over, honest.

One of the dangers with software is the classic Heisenbug: a bug that moves or goes away when you change something. We are chasing something which, by our best guess, is related to some aspect of memory access. This means that even the smallest change to software can have an impact. Make the code one byte shorter and you move all the interactions with cache lines when running code, and change the timing of everything as a result. When chasing a big like this, you cannot rule out those being an issue. So a change of one thing may result is a change in behaviour somewhere else. We have seem a lot of red herrings like this already.

The latest test is unusual for us. It is a change to an auxiliary processor that controls a specific clock signal to the processor before the code even starts to run. One we don't currently need. And we are removing anything we don't need, no matter how unlikely it is to be the cause.

What is fun is that this means we have not changed a single byte of the main code we are running.

If this works, and only time will tell, we can be really quite sure it is not some side effect of simply recompiling the code. It pretty much has to be the one thing we really did change.

Being able to test something so specific by a software change is quite unusual.

7 comments:

  1. The first step to debugging anything is to get a reliable method of reproducing the bug. And if changing unrelated code changes whether the bug occurs due to cache line issues etc, then as you say it becomes a right pain.

    ReplyDelete
  2. Debugging is "fun", particularly when you trip across unrelated bugs which screw up your results - I have some silent voicemails (which play as silent on multiple devices), and the outgoing message (RevK?) is missing from that line but not others. Trying to investigate, I end up in an accidental conference call with my mother and A&A's voicemail system ('on fail' and voicemail both triggered?), and the voicemail MP3 played as silent on my iPhone ... but not my MacBook Pro. (Rebooting the phone fixed that one, at least.)

    ReplyDelete
  3. Well, if anyone can fix it, it's yourselves.

    From an outside perspective, it looks like most of your customers wouldn't have noticed, as they'll flip to another LNS. Obviously the hardcore users with monitoring will notice it.

    Once it is all fixed, it would be interesting to see if the customers who were upset, were actually impacted or they just like a nice graph. I'm one of those who like a nice graph and it unsettles me when it isn't.

    I'd go as fas as to say I've got a leased line in the countryside, lol!

    ReplyDelete

  4. I left because of the issues. I’m on FTTP which should be reliable, my own kit has hundreds of days of uptime and the ONT is never power cycled, I’ve never had an issue with the physical fibre. So, A&A only had one thing to do, supply a reliable service and they were failing on that which made me question what I was paying their premium for. I didn’t join them to be an unpaid tester for kit they are looking to sell elsewhere for a profit.

    The reason they do their updates/downgrades in the early hours is the same reason a lot of people may be up and working at the same time. As I work in support I also have to update systems out of hours so I want something reliable 24/7 and it was a headache what was going on. An update to routers once every few months causing a drop overnight, perfectly okay and in line with most ISPs, but every other day it seemed something was being updated or downgraded on top of random crashes and it went on for too long. The option to select a time for when to be dropped seemed to go out the window and I just got dropped at random times for the updates. They just kept taking the full premium for the service whilst offering the worst reliability I’ve had from any ISP and I had less drops on a long line with ADSL. The issues are not confirmed fixed yet so the cycle of updates, crashes and downgrades could all start again, but at least they stopped using their customers to experiment with during my remaining time with them, so I’m grateful for that at least.

    Suffice to say, new ISP has same speeds, fantastic support, unlim

    ReplyDelete
    Replies
    1. The drops were not planned, that was the point, and customer equipment could reconnect within seconds - but it depends on the equipment and some would take minutes. We do planned work over night as you would expect. For some reason the issue was only present on the live servers, not the many test systems we have. But it looks like we are finally making progress, as you can see.

      Delete
  5. I'm on 300/50 G.Fast (and I get those speeds) and barely noticed there was an issue, despite the fact I do pay attention to the CQM graphs.

    ReplyDelete
  6. "when I was a kid, software could never cause a hardware hang" -- I grew up with the original 6502/6502A, and if it ingested any opcode ending in #x2 it was game over. Aside from one legal opcode #A2 (LDX imm) it irretrievably halts the processor. Fetch, Decode, Execute -> Death.

    ReplyDelete

Comments are moderated purely to filter out obvious spam, but it means they may not show immediately.

ISO8601 is wasted

Why did we even bother? Why create ISO8601? A new API, new this year, as an industry standard, has JSON fields like this "nextAccessTim...