Sunday, 15 July 2012

Zero packet loss

I was pondering the concept of a zero packet loss service, following some comments on a post in ispreview. The commenter was adamant that it is impossible to provide a zero packet loss service. Of course, this was silly anyway as what we claimed is that the Ethernet service allowed us to do zero packet loss maintenance on our routers, which is not the same thing at all.

But I was pondering what was meant by a zero packet loss service anyway.

Zero is a problem, for a start. With a lot of metrics that one is trying to achieve in a service, one can design the service to exceed the require metric by more than any margins of error so as to guarantee you achieve it. When talking of zero loss, you can't do that - there is no way to have better than zero loss, in there? So one is working against a brick wall of a target. This means you have to define a tolerance or carefully define the measurement parameters.

The closest one could consider the services we offer to zero loss would be a point to point uncontended link. These used to be bare fibre with termination equipment (WES), but these days such links are switched at the exchange (EAD). Either way, if one has a 100Mb/s uncontended point to point Ethernet link, then that can be zero packet loss as a service. Any packet you put in one end will come out of the other end. Obviously, if you want to send 101Mb/s on a 100Mb/s link then it won't work, but it won't be the service which is dropping packets. In that case it will be your switch or computer trying to send more data that has to delay the data or drop packets in order to get what it is sending down a 100Mb/s interface. The service can be zero packet loss.

Is it really zero though? Well, the problem is that any outage whatsoever, any time, ever, in the life of the service, even for a microsecond, means the service is not zero packet loss any more. So actual zero is probably impossible. It has to be zero packet loss (when the service is working), and then have caveats on repair times for when it is not. But, within normal tolerances of Ethernet links, one can offer a zero packet loss service.

Better than zero? There is also the risk that a stray particle flips a gate on a receiver somewhere and a bit is received wrongly so a packet dropped. Interestingly, the newer standards for Ethernet at very high speeds have error correction, just like disk drives and indeed many communications systems these days. So actually, you end up with a case that packets get through even with a specified level of interference in the medium. In a way, this is making a system that is better than zero, in that it is still zero loss in the face of certain levels of error. Normal EAD links don't have this, but I think the FTTC VDSL does have it in some configurations, which means stating zero loss is more feasible. Sadly the FTTC is normally a shared link back-haul to the exchange, so contended, and so not something we would sell as zero loss anyway. In the future, more and more links will have inherent error correction.

Internet services are a tad special in that Internet access is never uncontended or zero loss. We can (and do) have services that are zero loss uncontended links from customers to us, and then we connect on to the Internet. Transit providers can (and some do) offer zero loss guarantees over their transit network, and even compensate if that is not the case. But that is to their border only. The very nature of the Internet means packets to a specific end point could be lost due to congestion on a link. Thankfully we don't try and offer zero loss services over the Internet, obviously.

Zero packet loss router maintenance is what we actually claimed. This is much easier, and even industry standard. The principles are very simple indeed - you have more than one path the traffic can take (in each direction), and you ensure that traffic is switched from one path to another, so as to allow one bit of equipment to be worked on when it is carrying no traffic.

There are several means to do this, including routing protocols like BGP and OSPF, or low level protocols like VRRP. Virtual Router Redundancy Protocol is mainly used for fall-back, i.e. if something breaks, and can react within as little as 30ms (with version 3). However, if can be used to manage which is the active router as a deliberate step as part of router maintenance. With the FireBricks we have a built in controlled shutdown and startup sequence which means VRRP and BGP both actively change incoming traffic to the other router before rebooting to run new code. The reboot is well under a second, and the startup is sequenced to ensure we have routing for traffic before taking over as master again.

Whatever the technique, the trick is switching the traffic from one router to another. With routing protocols, this is part of the protocol itself - you simply change what you announce. With VRRP the switching means a different device becomes master, and it uses the VRRP MAC address to convince a switch to change where it sends packets for that MAC.

In either case you want the old router to still accept and forward traffic during the switch over. This means that the sending end can take what time it needs to do the switch. At no point is the sending end unsure where to send a packet, it is always either the old router or the new. Whichever it sends to, the packet is sent on to where it needs to go.

The means that no matter how faster the packets are flowing, no packet is lost by the switch over process. There is no fine timing and co-ordination required, as the old router can accept traffic for as long as necessary (seconds even) before the sending end switches over.

Once traffic is switched off the old router it is no longer involved, and so can be worked on, rebooted, upgraded, or whatever.

So, I stand by our claim that we can do zero packet loss maintenance on our routers for our Ethernet services.


  1. You have fallen for the troll. That guy is never happy and won't listen to other peoples facts.

    1. You are probably right - but it is useful for people to get a handle on what can and cannot be done. Contention, Congestion, and Packet Loss are complicated subjects for most people to grasp.

  2. P.S. testing with simple pings to customer on Ethernet FTTC from outside (so BGP for replies) doing a standard router s/w upgrade and re-boot, showed no loss. We may test with faster pings some time, but basically, at each step, the pings still answered so should be zero loss.

  3. Yes unfortunately we do have one particularly difficult individual who keeps churning up news comments like there's nothing better for him/her to be doing. Generally it's best just to ignore those people but, setting that aside, it's good that you managed to turn one of their rants into a constructive article that examines the issue in quite some depth.

  4. Isn't a zero packetloss service one provided over TCP?

  5. Isn't a zero packet-loss service just one that's provided over TCP?

  6. DT: Packet loss is actually the signalling for TCP congestion control. The transport protocol used doesn't determine whether packets can be dropped, though it (and the specifics of its implementation) can influence how likely that is to happen on a given link.