Saturday, 8 February 2014

What is Packet Loss?

The Internet uses a system of packets to send information. This means that whatever you are doing, whether accessing FaceBook, making a Skype call, playing an on-line game, downloading a file or reading an email, the information is broken down in to packets. These are not always the same size, and are typically up to around 1500 bytes (or characters) of data at a time.

Each of these packets carries some addressing information, and some data. The fact that packets are used means it is possible to have lots of things happening at once, with bits of one thing in one packet followed by bits of something else in another packet and so on, mixing up multiple things on one Internet connection. This is how it is possible for lots of people to use an Internet connection at once. The addressing data in the packet makes sure the right things go to the right place and are put back together at the far end.

This is all very different to old fashioned phone calls which work on circuits. They work by creating a means to send data (e.g. voice) continuously at a specific speed between two points, reserving the capacity for that link for the duration of the call. You either manage to establish the call (the circuit), or not, at the start. Once you have it, you have the circuit in place until you finish. It is a very different way of working to packets.

One of the problems you get is where a link of some sort gets full.

With a circuit based system like phone calls a full link (i.e. one already carrying as many calls as it can) will mean you get an equipment engaged tone. The call fails to start.

However, with a packet based system, when a link gets full you start with a queue of packets waiting to go down the link (adding delay) and ultimately you drop packets. That means the packets are thrown away. This can, and does, happen at any bottleneck anywhere in the Internet. The most likely being where the Internet connects to your Internet connection and create a bottleneck.

So packet loss is normal. It is what happens when a link is full.

The result of this packet loss depends on the protocol. The overall effect on any sort of data transfer, such as downloading a file, or sending an email, is that the transfer happens at a slower speed. The end points send packets of data at a slower speed so that they don't get dropped packets. Importantly, with a lot of protocols, the missed packets are re-sent which means the data does not have gaps in it.

Some protocols do not allow resending or slowing down, these include things like VoIP calls, like Skype, where you can't slow down a phone call. What happens in such cases is you get gaps in the call - break-up, pops, etc.

Some systems are clever and decide which packets to drop when a link is full, giving protocols like VoIP a chance to get through and dropping packets for protocols that can back-off if needed. We do this in A&A, for example.

However, there is another scenario where you can get packet loss, and this is where there is a fault. In the case of a fault you will find some packets are dropped at random. What usually happens is some of the data in the packet is corrupted (changed) by random noise or errors from the fault, and this means that the packet no longer checks out when it gets to the other end. Packets have built in checks to confirm nothing was changed, and if that check fails the packet is dropped.

The effect of fault based packet loss depends on the protocol.

For protocols like VoIP, the dropped packet simply means break up in the call. Even low level of packet loss can mean annoying pops and gaps in the call.

For protocols that can back off and slow down, well, that is what they do. They cannot tell that the packet loss is the result of a fault and not of a full link, so they slow down. But even when the slow down, they still get packet loss as it is random. So they slow down even more. They don't understand the problem, and just assume that a link must be getting full no matter how slow they go.

Imagine if driving a car with no speedo but you get a light saying "driving too fast". That is fine, when you see the light, you slow down, and you stop seeing the light. That means you drive at the right speed. But if the light is faulty and keeps saying "driving too fast" at random, you will slow down, and still see the light, so slow down more, and before you know it you are crawling along at walking speed.

This means that even low levels of random packet loss can massively slow down a data transfers.

Packet loss when a link is otherwise idle is a fault.

The problem is that when you measure packet loss you do not always know if the link is full or not. Your tests of packet loss, usually a protocol called ping, could be losing packets because a link is full sending an email, or it could be losing packets because of a fault.

The key is to measure packet loss when a link is otherwise empty of traffic, so that the only reason to drop packets is because of a fault.

The other problem with measuring loss is how you measure it. The normal measure is percentage loss. If you send 100 packets, how many arrive and how many are lost. This is fine, but random corruption causing loss will have a much higher chance of causing a packet to be lost if the packet is bigger. So you have to look at packet loss and packet size. From this you can work out a rate of corruptions on a link and predict the loss for other packet sizes.

The best measure of loss as a simple percentage is the loss when sending full size packets (1500 bytes) which is what the data transfer protocols (like TCP) use. Even a 1% or 2% of loss of such packets can cause TCP to slow down massively. It does not work like taking away a couple of percent of speed - the data transfers keep slowing down as they keep thinking the line must be full.

2% loss is not like 98% working speed!

A simpler, and less intrusive measure of loss, is a simple short LCP echo. LCP echoes are a normal part of most Internet links, and A&A do them every second and record the loss for every line. This is only a few bytes, and so packet loss that is a fraction of a percentage could mean several percent at full packet sizes. This is why it is so important to take even very low levels of LCP echo loss seriously.

This is why packet loss needs to be a clear metric of quality and faults and why companies like BT need documented packet loss measures that are considered a fault. For some inexplicable reason such a simple metric is not part of any service level guarantee, and not considered a "fault" by BT!

Oddly, buying transit, which means sending and receiving packets from thousands of places all around the world (not just exchanges in the UK) and even laying cables under the ocean, one can get a service level guarantee of ZERO packet loss ever. This shows how seriously transit providers take such things. They even guarantee latency (the time taken to transfer packets). Even more oddly, such services are typically around a 50th of the cost of BTs connectivity to exchanges around the UK where no service level guarantee exists for packet loss. It is a strange world we live in some times isn't it?


  1. I'd be wary of predicting packet loss of one size packet to another. This is because packet loss due to a bug can be packet size dependent, for two reasons: 1) there can be different code paths for different sized packets, and 2) there can be a different code path when you get near the end of a circular buffer, and the alignment of packet to buffer is affected by its size.
    This isn't a theoretical comment: this kind of bug has been present in DLSAMs which you probably have had to work with.
    Of course, that's not the only kind of bug. a DSL line failing to adapt when the noise level has increased will act in the way you describe.

    1. Yes, bug based loss is another category I did not go in to here and can be a bugger to find.

    2. I had wondered about bugs, particularly since the loss seems to be upstream only - mis-measuring the line and giving a "perfect" sync wrongly, or just mangling certain packets like the HG612 bug RevK found last year. That, or the extensive construction work around the cabinet recently (it's on one end of a bridge which has just been replaced!) having damaged it or its backhaul fibres.

      Interestingly, while (some of) BT seems to dismiss it as trivial at present, packet loss and latency are the key measurements the Samknows devices track for Ofcom's official broadband monitoring programme, along with measured throughput and link uptime, even publishing comparisons of ISPs on these metrics, so BT can expect extra pressure to reconsider this attitude soon from other ISPs who fear being penalised in ratings for it.

    3. I am wondering if it is a "dirty fibre". The uplink will typically be two fibres, one for tx, one for rx, so loss in one direction is perfectly sensible. Even when using one fibre for both ways, the possibility of a faulty emitter or receiver can give one way errors. The fact that LCP is showing so much lower loss suggests it is a raw bit level fault rather than a packet levee bug, IMHO.

  2. Then there is packet loss due to deliberate action by the ISP, throttling certain connection types...(not suggesting AA do this!) this is a variant of "link full" but could certainly happen on an idle link.

    Something I would add to your explanation is that the corrupted packet will not be delivered all the way to the end point as the checksum is verified at each router along the way.
    It may be possible by playing with a variant of traceroute to see approximately which section of the link has the most traffic loss....

    1. Throttling shouldn't impact an idle link, unless they're trying to throttling traffic down below the c 10 bytes per second of the 1 Hz LCP echo A&A use - and even then, only if the ISP is trying to throttle their own link to the customer, rather than the customer's IP traffic as would normally happen.

      I'd like to think no ISP would try throttling "broadband" down below 80 bps in either direction!

      Of course, even when the end user's link is idle, there will be some traffic on the other network sections - something A&A specifically monitor for and report - so if you were to see packet loss on, say, an idle Plusnet line, it could just indicate that Plusnet's own connection to BT is congested.

      Unfortunately, the use of PPPoE means there is only a single visible network hop between end users and the ISP: either the PPPoE packets get through that hop, or not. A&A then do some detective work to analyse which groups of links are losing packets: all the lines on a particular exchange, for example, or all the lines through a particular BRAS.

      In my case, we can tell the problem is not between my exchange's Ethernet switch and A&A, because packets to/from other customers on the same switch are getting through OK. Unfortunately, I'm the only A&A customer on this cabinet, so there's no easy comparison to rule that out without help from BT.

    2. Clearly A&A providing better insight into a problem than the average ISP again! :-)

      With respect to throttling I was referring to the dubious practice of deliberately targeting specific destinations e.g. I read about connections to AWS being hit by a particular ISPs policy.... Just giving a more general category for Andrews excellent piece covering the issue.

      I've always been a bit suspicious regarding the backhaul on the BT network from a local exchange... nothing A&A can do about it... but I do wonder if a quiet rural exchange like mine which missed out on 21CN and now is being listed as FTTC this calendar year (believe it when I see it!) really has appropriate bandwidth upstream.....

      Good luck with getting your link fixed :-)

    3. It's believable that 21CN backhaul (which is a requirement for FTTC) is much, much larger than 20CN backhaul to the same exchange.

      An exchange with 34 Mbit/s backhaul in 20CN can (for similar costs) have 1 gigabit/s 21CN backhaul; 155 Mbit/s 20CN backhaul ends up closer to 10 gigabit/s 21CN backhaul in costs to BT, and 622 Mbit/s 20CN backhaul costs BT the same as somewhere between 40 and 100 gigabit/s 21CN backhaul.

      This is a huge disparity in costs; it's why the 21CN project happened at all - basically, the cheapest 21CN backhaul link was larger than the majority of 20CN links.

  3. Could you ask someone to take a look at my constant packet loss? many thanks -cwcc@a

    1. I have passed this on to support for you.

  4. What do you use to issue the LCP requests? Is there a command line tool like ping which does similar?

    1. You can only do that from a PPP endpoint. It is normal for the PPP endpoint to send them periodically.