Looking for clues, testing an hypothesis, narrowing down the possible causes step by step.
It is even more, shall we say, "fun", when it is not definitely a software or definitely a hardware issue. Well, to be honest, we know it is hardware related, but it could be hardware because the software has set something up wrong, or is doing something wrong, maybe. Really a processor hang should not be something software can ever do no matter how hard it tries, in my opinion, but in a complicated system with complicated memory management hardware, it is possible that a hang can be the side effect of something wrong in software.
I was going to say that "when I was a kid, software could never cause a hardware hang", but I am reminded not only of the notorious "Halt and Catch Fire" accidental processor operation, but that one could walk in to a Tandy store and type the right POKE command on one of the earliest Apple machines and turn it in to toast, apparently. So maybe there has always been this risk.
The latest step in the "watching paint dry" process of trying to diagnose the small issue we have with the new FireBricks is underway now. It has been a long journey, and it is too soon to say it is over. It will be an awesome blog when it is over, honest.
One of the dangers with software is the classic Heisenbug: a bug that moves or goes away when you change something. We are chasing something which, by our best guess, is related to some aspect of memory access. This means that even the smallest change to software can have an impact. Make the code one byte shorter and you move all the interactions with cache lines when running code, and change the timing of everything as a result. When chasing a big like this, you cannot rule out those being an issue. So a change of one thing may result is a change in behaviour somewhere else. We have seem a lot of red herrings like this already.
The latest test is unusual for us. It is a change to an auxiliary processor that controls a specific clock signal to the processor before the code even starts to run. One we don't currently need. And we are removing anything we don't need, no matter how unlikely it is to be the cause.
What is fun is that this means we have not changed a single byte of the main code we are running.
If this works, and only time will tell, we can be really quite sure it is not some side effect of simply recompiling the code. It pretty much has to be the one thing we really did change.
Being able to test something so specific by a software change is quite unusual.