Tuesday, 20 November 2012

More than "complete" broadband?

Had an interesting complaint today - well more of a query - apparently we provided a customer with more than 100% uptime in October!

Well, you can guess what it was, I hope. It was 28th October. It has 25 hours in the day. Customers were seeing 104% uptime for October 28th.

It struck me as odd - as I am not usually that foolish, and I coded this.

This is where it gets interesting though. I checked my code. We log uptime in seconds, and we work out percentages based on local time. That means we were working on a 90,000 seconds day for 28th October. So we allowed for the extra hour in the percentage calculation... I thought I was not that foolish. So how do we see 104% uptime?

Looking in to it more, we are seeing the uptime for the day recorded at 93,600. That is 26 hours in a 25 hour day...

Arrrg!

I believe I have eventually tracked it down though.

Each hour we get RADIUS accounting. We record the change since the last hour. The "last hour" record, well, "last update", is stored in an sql database. It includes the tx and rx, byte and packet, counts and the time of the last update.

When we process the new update we can see the time and usage has moved on, and we update the stats. The uptime normally adding 3600.

The catch is that the time is recorded in local time.

At 00:00:00BST we record stats and save the time as 00:00:00.
At 01:00:00BST we see a last time as 00:00:00 and assume BST, so record 3600 seconds, and record time as 01:00:00
At 01:00:00UTC we see a last time as 01:00:00 and assume BST, so record 3600 seconds, and record time as 01:00:00
At 02:00:00UTC we see a last time as 01:00:00 and assume BST, so record 7200 seconds.
At 03:00:00UTC we see a last time as 02:00:00 and assume UTC, so record 3600 seconds.

If a customer logs out and back in between 1am and 1am then we amplify this, recording yet another hour. We had one customer with 107% uptime for that day!

The fix is simple - those fields are now assumed to be UTC. Lets hope sql has no issue with March where a time of 02:00:00 is not valid.

Just goes to show, you can give "more than 100%" :-)

P.S. Before you look, I am fixing the erroneous stats.

2 comments:

  1. Time is a wonderful way of confusing the hell out of people as they can't seem to handle the fact that there's more than one time system.

    It's for this exact reason why I always code to UTC & only convert to another system when displaying it. It solves so many issues but I still get asked "Why are the logs not in local time" gah!

    Anyhowm, there's still another spanner to throw, the leap second... ;-)

    ReplyDelete
    Replies
    1. Leap seconds are being conveniently ignored :-)
      I was tempted to use them on billing once, bills due 23:59:60
      I use UTC a lot, but some things are organised by local day, so get local time.
      I try to be careful, as you can see, but still get caught out...

      Better than "The GMT offsets shown do not reflect daylight savings adjustments"

      Delete