Friday, 24 April 2015

What is a broadband fault?

Exterminate the orc?
Continuing a theme here, but my mate Mike came over. One of the games he is quite good at is playing devil's advocate, rather well, and he actually made some really good arguments on the issue of SFI charges and faults. I'd like to thank him for that, oh, and for the dalek he brought over to surprise us!

The crux of his point is that the very definition of a broadband service is that it is rate adaptive and so it is a service that will try and get the best performance on the voice grade copper pair to which it is attached. It is possible for the characteristics of a voice grade copper pair to be so bad (very long line) that it cannot even do broadband, and that would be tough. That basically the performance of the broadband line is kind of whatever it is. Indeed, a comment on my previous post about TalkTalk was raising this very point.

He has a very good point, and it has allowed me to hone the arguments I can use against such points with the likes of BT and TalkTalk.

Yes, the broadband is what it is, "as well as it can do" on the voice grade copper pair to which it is attached. We do accept that to some extent. There are caveats - in that the copper pair will have a forecast speed, and if the line is way out of that, then that probably means something is wrong. But in general, you get what you get. However, this relates to the rate of the broadband, as it is rate adaptive.

I would argue that there is a huge difference between a line that is a low rate on a line because it is all the line can do, even if that means the rate has reduced a bit over time due to increases in cross talk from other lines and so on, and an actual fault.

The issue comes down to how you define a fault in the broadband. This is different to how you define a fault in the voice grade copper pair.

One of the key aspects in defining a fault - perhaps the key aspect - is that a fault can be fixed. It is a condition that is the result of some aspect of the copper pair which is not as good as it could be giving the length and routing of the pair. Faults could be down to poor joints, corrosion, water ingress, poor insulation, electrical contacts, degraded materials. It could be fixed by using a different pair, or repairing joints or equipment.

So how do you spot a fault? Well, BT plc t/a BT Wholesale actually defined one way of doing this. They monitor the line in the first 10 days and find the maximum stable rate for the line, and set a fault threshold rate (FTR) for the line based on that (with a percentage taken off). This defines a sync speed below which the line is considered to have a broadband fault. That is excellent as it gives a crystal clear metric which we can all agree defines a fault. TalkTalk don't do this, but it would not be hard for us to do the same based on sync speed history and agree a metric with them.

Another way to measure a fault would be to consider that the line loses sync lots of times. To be honest, once a day is too often. Normal lines stay in sync indefinitely and would only lose sync do to some external interference, or a power blip or some such. So frequent loss of sync should be a fault. Bear in mind, even a long and highly lossy line should stay in sync, albeit at a slower speed.

We also have metrics in terms of packet loss. Whilst we measure latency, that is not normally a clue to any sort of line/sync issue and more likely to be backhaul congestion (or a router moving enough traffic to build up a queue). I think packet loss (when no traffic) is a really good indicator of a fault. Also, if the line has any internal metrics (OAM frames, Header errors, FEC errors) that is an indication of a fault.

Indeed, the DSLAM can almost certainly report if there are lots of bit swaps, problem frequency bins, FEC errors, HEC or other error stats that indicate a fault, as opposed to simply being a poor/long line.

Unfortunately, neither BT nor TalkTalk define metrics for packet loss, or line errors or re-syncs as a fault metric. I think they should.

Fortunately, when there is a fault, the levels are pretty clear cut. It is rare for a line to have a "tiny bit of loss", though not impossible.

In practice, the fact that the broadband has a fault of some sort, based on our monitoring, or the DSLAM stats, is not itself usually the issue. Usually we can agree that there is an issue with the broadband. We can also usually agree that the copper pair meets the voice spec, when it does.

One of the key things that tells us we have a fault is when a line that used to just work fine (at whatever speed it had) now has problems which it did not have before by any of those metrics. That suggests it can be fixed and made to be back the way it used to be by some means. We know what is possible at that point.

I think we just need to pin down ground rules with providers for what is and is not a broadband fault, and pin down that they are responsible for broadband faults, not us.

So let's start the discussions with them, and try to make things better. Well done TalkTalk for being the ones wanting to talk at this stage. Well, either that or we send round the exterminator!

Did not fit through the door


  1. I've a little sympathy for BT but they do make life difficult for themselves. Sympathy in the sense that the copper pair only ever envisaged for voice has been asked to support ever increasing amount of bandwidth/data. BT engineers are now expected (or should be) to understand RF carrier systems and I think the reality is that they don't. To some extent BT have probably not had their expectations realistically set by the broadband equipment manufacturers. The bottom line is that the rate acheivable on any particular pair is down to pair quality properties (which BT have some control over) and local RF interference (which they don't!) As you say it is rate adaptive...ADSL Max went as fast as it could, before that things were banded.

    BT are exceptionally poor at monitoring & aggregating line performance issues and make life difficult for themselves, ISPs and end users. As a (now ex) AAISP customer I'm exceptionally grateful for the effort put in by you in the early days of me moving to the current property some 5 years ago. My line graph still appears in your support pages as the backdrop for a line with problems. It took an incredible amount of effort to get BT to get an RF aware team out to track down a local rogue RF source that was blasting all over broadband frequencies across phone & power lines (REIN as they call it) Packet loss was intermittently very high (and actually varied upon size of packet) and sync rate dropped like a stone. BT insisted that REIN is a rare event and that if this was REIN lots of people would be complaining. Lots of people were....but all to different ISPs.

    I've subsequently had 2 more REIN events in the local area, averaging 1 every 2 years: Faulty Sky Italia satellite box PSU was the first, faulty Thomson Sky+ PSU the second and most 'amusingly' the 3rd was a faulty ADSL router PSU...that family in particular were at their wits end dealing with their ISP trying to get their broadband service up & running. REIN being rare is bollox, simple as. All these cases affected many households but no one at BT joined the dots, collated all the calls and thought 'hmmm REIN?'. Some of that I'm sure is down to ISPs raising the 'SFI charge' flag.

    BT could help themselves out here by monitoring ADSL line parameters (sync, snr, error rates) and aggregating changes by geographical location. They could even keep a log of such figures to act as a baseline for fault investigation & new line installs. But ultimately when BT provide broadband service they confirm the status of that service using the wrong's not really down to pair quality, it's about how well or even whether the line passes data. Latency, packet loss/corruption and bandwidth (in that order) are the real measures. SFI doesn't test the correct things.

  2. TalkTalk have an extra issue here, however, which is that they give you control of the line profile.

    If you choose a 6 dB fast path profile, and I have a local noise source that cuts in and out, but adds 8 dB of noise for up to 10 ms as it starts up and shuts down, you're going to have packet loss, if not loss of sync, every time that local noise source kicks in.

    You (or your customer) could trivially fix it by either (a) moving to a 9 dB fast path profile if latency is more important than throughput, or (b) moving to an interleaved profile with at least 16 ms of interleaving. However, any DSLAM level monitoring is simply going to show CRC, HEC, LOM and LOS errors; exactly the same as it would show if 6 dB fastpath was reasonable for this line, but there's a loose joint at the exchange that causes problems every time a HGV thunders past.

    Add in things like the annual cheap Christmas lights problem, and you've got a deep issue for the wholesaler; for every complaint where there really is a fault, you get several where the issue is either a bad choice of profile by the customer, or a local interferer added by the customer who's unaware of RF issues with their new toy.

    The first is fixable - compelled DLM before you are allowed to report a fault, and the DLM's profile choice is final. The second is harder to cope with - what do you do about the customer who decides that the modem lead is "unsightly" along the wall, and can best be improved if he covers it with Christmas lights?