Thursday, 2 February 2012

Two steps forward, one step back

Well, today has been interesting.

First thing was that we had a report of a /16 not routing to the Internet... The result was baffling and led to finding a rather obscure bug in BGP when using route reflectors (yes, Jon, OSPF OSPF OSPF, I know).

Basically, there are reasons to ignore a route - the RFC specifies these (cluster list showing our cluster, originid being us, etc). We do this. Good.

Sadly though we actually ignore the whole update, including the incidental withdraw prefixes in the same update. Bugger...

So upgrading around 15 boxes during the day, and I am pretty sure without losing a packet - win! - we have that fixed, and all seems fine.

Now to start seriously moving stuff over. Seems a visit to site needed - one cable showing unplugged?!?; A DSL router to install (backup management LAN); and some nice environmental sensors to install. That will be tomorrow.

DNS resolvers all working - linked in to route reflectors as local versions of our published resolvers. In fact everything now linked to two core route reflectors. Yay!

Tonight I started allowing lines to new LNSs as a test - i.e. any lines that reconnect were sent to new LNSs. We had tested a lot. We got Be, BT 20CN and BT 21CN on line and working... Good!

Then a snag - at least one wholesale L2TP customer did not route back to us on the new LNSs. Some worked, some did not. So job for tomorrow is chase them all to ensure routing all in place and allowing new LNS IP addresses through firewalls, etc. Fun!

So lines back to existing LNSs for now. If we can sort that tomorrow we can move everyone at the weekend.

We will probably set up at least one transit and one peering link on new kit tomorrow as well. Should be pretty simple and low risk (we always say that).

Still - progress...

4 comments:

  1. Can you explain what "withdraw prefixes" are? Your concerns was about a /16 not being advertised, so how are withdraw prefixes related?

    ReplyDelete
  2. A BGP update includes announced and withdrawn prefixes, though they are unrelated. In this case a /16 was withdrawn but that was not seen because of the bug (withdraw sent in same update as an announce that was to be ignored under route reflector rules), so the route stayed in the table. The route then came back but as a longer prefix. The result was a bogus route in the route reflectors that was shortest path causing traffic to loop rather than go where it should.

    ReplyDelete
  3. I know, I know... We have several really subtle things to change on the code first, including stuff to handle non standard BGP.

    But I can take a hint re OSPF, honest.

    ReplyDelete