I was pondering the concept of a zero packet loss service, following some comments on a post in ispreview. The commenter was adamant that it is impossible to provide a zero packet loss service. Of course, this was silly anyway as what we claimed is that the Ethernet service allowed us to do zero packet loss maintenance on our routers, which is not the same thing at all.
But I was pondering what was meant by a zero packet loss service anyway.
Zero is a problem, for a start. With a lot of metrics that one is trying to achieve in a service, one can design the service to exceed the require metric by more than any margins of error so as to guarantee you achieve it. When talking of zero loss, you can't do that - there is no way to have better than zero loss, in there? So one is working against a brick wall of a target. This means you have to define a tolerance or carefully define the measurement parameters.
The closest one could consider the services we offer to zero loss would be a point to point uncontended link. These used to be bare fibre with termination equipment (WES), but these days such links are switched at the exchange (EAD). Either way, if one has a 100Mb/s uncontended point to point Ethernet link, then that can be zero packet loss as a service. Any packet you put in one end will come out of the other end. Obviously, if you want to send 101Mb/s on a 100Mb/s link then it won't work, but it won't be the service which is dropping packets. In that case it will be your switch or computer trying to send more data that has to delay the data or drop packets in order to get what it is sending down a 100Mb/s interface. The service can be zero packet loss.
Is it really zero though? Well, the problem is that any outage whatsoever, any time, ever, in the life of the service, even for a microsecond, means the service is not zero packet loss any more. So actual zero is probably impossible. It has to be zero packet loss (when the service is working), and then have caveats on repair times for when it is not. But, within normal tolerances of Ethernet links, one can offer a zero packet loss service.
Better than zero? There is also the risk that a stray particle flips a gate on a receiver somewhere and a bit is received wrongly so a packet dropped. Interestingly, the newer standards for Ethernet at very high speeds have error correction, just like disk drives and indeed many communications systems these days. So actually, you end up with a case that packets get through even with a specified level of interference in the medium. In a way, this is making a system that is better than zero, in that it is still zero loss in the face of certain levels of error. Normal EAD links don't have this, but I think the FTTC VDSL does have it in some configurations, which means stating zero loss is more feasible. Sadly the FTTC is normally a shared link back-haul to the exchange, so contended, and so not something we would sell as zero loss anyway. In the future, more and more links will have inherent error correction.
Internet services are a tad special in that Internet access is never uncontended or zero loss. We can (and do) have services that are zero loss uncontended links from customers to us, and then we connect on to the Internet. Transit providers can (and some do) offer zero loss guarantees over their transit network, and even compensate if that is not the case. But that is to their border only. The very nature of the Internet means packets to a specific end point could be lost due to congestion on a link. Thankfully we don't try and offer zero loss services over the Internet, obviously.
Zero packet loss router maintenance is what we actually claimed. This is much easier, and even industry standard. The principles are very simple indeed - you have more than one path the traffic can take (in each direction), and you ensure that traffic is switched from one path to another, so as to allow one bit of equipment to be worked on when it is carrying no traffic.
There are several means to do this, including routing protocols like BGP and OSPF, or low level protocols like VRRP. Virtual Router Redundancy Protocol is mainly used for fall-back, i.e. if something breaks, and can react within as little as 30ms (with version 3). However, if can be used to manage which is the active router as a deliberate step as part of router maintenance. With the FireBricks we have a built in controlled shutdown and startup sequence which means VRRP and BGP both actively change incoming traffic to the other router before rebooting to run new code. The reboot is well under a second, and the startup is sequenced to ensure we have routing for traffic before taking over as master again.
Whatever the technique, the trick is switching the traffic from one router to another. With routing protocols, this is part of the protocol itself - you simply change what you announce. With VRRP the switching means a different device becomes master, and it uses the VRRP MAC address to convince a switch to change where it sends packets for that MAC.
In either case you want the old router to still accept and forward traffic during the switch over. This means that the sending end can take what time it needs to do the switch. At no point is the sending end unsure where to send a packet, it is always either the old router or the new. Whichever it sends to, the packet is sent on to where it needs to go.
The means that no matter how faster the packets are flowing, no packet is lost by the switch over process. There is no fine timing and co-ordination required, as the old router can accept traffic for as long as necessary (seconds even) before the sending end switches over.
Once traffic is switched off the old router it is no longer involved, and so can be worked on, rebooted, upgraded, or whatever.
So, I stand by our claim that we can do zero packet loss maintenance on our routers for our Ethernet services.