I'd thought I'd share one of the challenges of my day today - a very minor thing but it shows why some software can be such a nightmare. Maybe I can explain it in a way that is easy enough for non engineers to understand.
Sometimes a computer may be doing something wrong. That happens. One example which customers will have noticed is our "blip graph".
What did you change?
One of the key steps in diagnosis of something like this is to look at what you changed. You then try and see if there is some link between what you changed and what is going wrong. In many cases you can just look at the changes and the error sticks out like a sore thumb.
A perfect example would be if I had, for some reason, been working on the code that creates the blip graph from the database, or if I had been working on the code that puts the blip counts in to the database.
I would be able to look at my change, and wonder why my own testing had not shown the problem as well. There are tools to show me exactly what I changed.
It is also really useful if a problem is reported quickly as you also remember why you changed something and what you were trying to actually do as well.
We changed everything and nothing!
The problem is that we changed everything because we have done a major upgrade on clueless. We have also changed nothing, in that none of the code has been changed, just built on the new machine.
The code that makes the blip graph has not changed, and the code that displays the blip graph has not changed. Clearly the database is working as we have some of the blip graph. Indeed, it really made no sense.
One of the key things that lots of systems have are error logs, and we check these. But there are no errors being reported by the system that generates or displays the blip graphs after the upgrade, and were not in the past. So no clues there...
How did it ever work?
After a lot of digging I have found the cause, and it leads to one of those special things that can so often happen with software. HOW THE F*CK DID THIS EVER WORK?!
The "digging" took quite a few hours, because there simply was no logic to it. Nothing had been changed recently in the code, and no errors showing.
I quickly worked out that the displaying side was probably OK, but the database has zeros for the "logouts". The code to record the data looked the same for both login and logout, so how could it only be recording one side?
The eventual bug was a stupid mistake on my part in the code, written 8 years ago. I was comparing a data and time value with a time field in one case because of a simple typo. For the login side I did not have the same typo. It was subtle.
The problem is that the database server used to (silently) decide that I meant to just compare the time part, and get one with it and "just worked". Now, some change in the date/time logic in the database means that it considers the comparison not to match - though not an error, so it (silently) does nothing, instead.
The fix was therefore very simple, and now we have working blip graphs. Just one of dozens of small things to check today. So, if you do see thinking no quite right on clueless, do let us know.
I hope that gives some insight in to the perils of programming.