Thursday, 2 October 2014

BGP

This is going to be a tad technical - it makes our heads spin and we have people with a lot of experience of BGP. We don't always follow convention in what we do anyway, but we try and ensure things work well.

BGP allows routes to be announced, and there are some basics to this.

For a start, you want the BGP to connect via the same physical media as the traffic to which the BGP announcements will relate - this is basic stuff - you don't want a working control link and a failed forwarding link. So most BGP is done on a LAN of /30 or /29 between routers. That is not too hard.

You also want two routers wherever you can, in-case one fails. The classic peering is two routers at each side links via a /29 LAN.

But there are a few things you want to try and handle - like route reflectors. We have a LAN in Telehouse North with a lot of edge BGP routers getting transit and a lot of edge LNSs getting routes to customers, and they all need to see each other. So we run a pair of reflectors to which they all peer at IBGP. This allows them all to see the real next hop for every route. So far so good.

We also link out via HEX and Maidenhead via multiple links and use IBGP and reflect routes out.

So, all works well, but this week as the final steps in a long term project, I have been working on "clean" BGP shutdown and restart. The simple goal is to try and allow a router reload with zero packet loss for customers.

Shutdown

The first step is shutdown, and my initial idea was simple. The router being reloaded should cleanly withdraw all announcements, and continue routing for a while before shutdown.

This is fine in principle. If the main and backup routers talk to the same peer router(s) then they will immediately see the backup route when the primary is withdrawn, so not drop a packet.

The issue was where the peer or transit had multiple separate routers talking to us - the withdraw creates a black hole until it is propagated to a common point where the backup route then comes back down and it gets a new route.

The solution to this black-hole is not to withdraw but announce as a way lower priority (prefix stuffed, MED worse, etc). This means at each stage all the peers routers have a route, but the lower route goes up to a common point, is replaced with the backup which comes back down. All the time packets flow.

So, the final solution is a configuration option on firebricks to allow each peer to be either (a) simply closed, (b) withdrawn and a configurable delay, or (c) announced low priority and a configurable delay, before closing a rebooting.

We think we have shutdown sussed...

Start up

The next issue is startup. We need to make sure nobody sends us traffic until we know how to route it. This is one of the fundamental rules of BGP really, but it has issues.

If we announce a locally connected subnet, to which we can route, but the secondary router such as a route reflector, announces routes that ultimately resolve to that subnet from the transit routers, then we create a black hole. Even though we have not sent transit routes, we become the target for transit traffic because we have announced the local subnet.

This was our latest startup black-hole in our testing yesterday.

Now, we have damn fast BGP. We can send a full table (500k routes) over BGP in around 4 seconds. The receiving side can process within 6 seconds, and have in the forwarding table within about 20. Obviously lots of transit and peers slows this a tad, but it is fast. The latest TCP work has ensured very fast and efficient BGP TCP handling.

But how do we solve this startup black-hole issue which could last 20 seconds? Well, the answer is VRRP.

VRRP

VRRP is great on a LAN, and the whole shutdown and startup for VRRP is well handled. We become low priority so backup before we even touch BGP shutdown. At startup we don't become high priority until the route forwarding is up to date, and even then we add an extra delay.

The trick is to announce our locally connected subnets using a next hop of the VRRP address and not our address!

This means, as we start, all of the routes from the backup, and even the routes we announce, are all sent to the backup which is master for VRRP and knows how to route. Only when fully up and routes installed do we become master and get the traffic.

The whole process is a complex sequence with interactions of BGP, core routing logic and VRRP, but it should work.

The trick is telling BGP to use its local VRRP address as itself as a next hop!

If you cannot reload a router with zero packet loss, you are not doing it right :-)

Update: We loaded four separate routers this morning (Saturday), each in a different part of the network. The shutdown and startup sequence looked perfect. However, running a ping through them we did see a dropped packet, and we should see none. So something was still routing at shutdown. It sounds like a config issue at this point, or possibly a cached route, and we're looking in to it. We do now have nearly perfect router reloads and hopefully we have this so that we don't drop a single packet very soon.

3 comments:

  1. For the next trick, could you look at what might be involved to allow hitless reloads of the LNS routers? Currently PPP session uptimes on A&A are typically much lower than bogons, zen, etc.

    ReplyDelete
    Replies
    1. We did have that in the past - it did not scale as well as we would like. Even so, whilst uptimes are lower, the downtime in tiny - i.e. as fast as your router will reconnect (which can be well under a second with the right kit).

      Delete
    2. There's another element to consider: currently, switching LNS switches the IP address of the endpoint, which happens to influence BT's internal routing of that traffic. That's how a bad BT Wholesale 10G backbone port was found by BT TSOps earlier this year (and how RevK was able to work around it for me while BT were tracking it) - if the IP address were kept constant by VRRP, we'd probably never have been able to find that BT fault to get it fixed.

      Having said that, syncing session state between a pair of Firebricks in a master-slave pair might be workable, pfsync-style, and avoid losing the session in almost every case...

      Delete