Wednesday, 1 February 2012

Ooops

Well, what can I say - sorry to customers for the blip Tuesday evening. In fact there were a few "issues" in the late afternoon and then something of a more major "blip" lasting around 15 minutes just before 7pm.

So time to 'fess up as to what actually happened. It was us this time.

As per planned work notices we are in the middle of a major network upgrade. We have 10 shiny new routers/LNSs in the new rack and we are gradually moving things over. We want to ensure we are not the bottleneck and this means a bigger network that goes over a gigabit on various links.

One of the first steps is bringing these new routers on to our existing network. This means establishing some internal BGP links. Once this is done we can move various of the external links from one part of the network to the other in controlled steps. We are using IBGP not OSPF for various historical reasons and to date it has done what we want perfectly - we understand BGP quite well (or so we thought).

However, the main downside of IBGP is you have to mesh all of your routers. Not a problem when you have 4 of them, but when you have 10, and when connecting to the existing 4, that is a lot of BGP sessions. This is why internal routing protocols like OSPF win in such cases.

However, not a problem, we'll use route reflectors - they allow internal routing to be relayed within the internal network avoiding having to fully mesh the routers.

This is where the fun starts. Even though we have people working on this that have used BGP before joining A&A, and even though I coded the BGP in the routers myself, carefully following the RFCs, including route reflector logic, we have not actually used route reflectors in anger before.

Well, now we know - the trick is not to make a loop of route reflectors. The problem you get is a route gets injected in to this loop and then it sticks. Even if you withdraw the original announcement the loop sees its own copy (reflection) of that announcement from another route reflector and so keeps announcing it. They also tell your edge routers about the route!

To add to the fun, if you have anything not set up right in setting the next hop, you can end up with routes that go to places that don't know what to do with them (black holes).

Re-reading the RFCs this is actually quite simple, and the next test will follow the guidelines somewhat better. We will not have a loop of route reflectors but a pair of them, and the edge routers will be normal IBGP to them. This will allow the redundancy, simplicity and scalability that we want. We have fixed the next hop set up as well.

To be honest, this was a silly mistake, and one we won't make again. The impact was the minor issues in the afternoon. The actual issues were very hard to pin down as they meant some routes were broken and some were iffy (taking the wrong path in some way) but over all traffic levels stayed the same so clearly not a major issue generally. Thanks to the customers that reported what they were seeing.

Then we come to the bigger outage of 15 minutes or so. This was part of simply tidying up after the earlier problems and making the routing configs consistent. Again, a very low risk activity. We are still trying to get to the bottom of that though as it should not have caused an issue. The fix was "have you tried turning it off and then back on again" in that we reset the LNS completely, clearing all of the BGP sessions and starting from scratch. One of the jobs we still have to do this morning is trawl the logs to find why things got messed up. I would love to have spent more time tracking the problem as it happened, but getting things workings was somewhat more important. The symptoms were damn strange as sessions appeared to start up but have issues with RADIUS, even though RADIUS was apparently working and there was no apparent reasons for the sessions to have gone down in the first place. The reset meant we lost graphs for the day, which is always a nuisance.

Anyway, today's job will be carefully planning the next stages and deploying them very carefully and slowly.

Of course, and I am sure some customers will be asking this, why the hell is this not done at 3am on a Sunday morning or something? Well, yes, if this was work that was going to take out service, it would be. This is, however, work that should not actually impact service at all - it is very routine low risk stuff. It is also a case where the impact of something not being right is hard to see. If we had done this over night it would not be until 9am when a few customers say there is some "odd routing" that we would find this issue - everything looked fine when we did it!

In general we find between 5pm and 6pm to be a good time for some of this "at risk" work as it is after most business customers are finished (not all, we know), but is before the home users start (mostly) and at a time when people are still around to tell us if they can see anything not quite right.

Over night work is ideally suited to cases that take out part of the network - where the work is simple mechanical stuff - moving cables and the like - where those working on it can see they have done the job right immediately and there is nothing new. Telco work that takes out network links is scheduled for over night for this very reason.

The end result of all of this will be much more capacity in our network, and some major increases in bandwidth to our favourite telco... So sorry for the inconvenience, and we really will try not to break it like this again. Thanks for your patience.

P.S. it does seem odd not blaming our favourite telco for something. After all, over the last week we have seen BRASs reset and take out services for hundreds of customers for similar periods, but we are all kind of used to that...

P.P.S. A simple loop of route reflectors is not enough to break things - you need an ordinary IBGP link in between to lose the cluster ID.

6 comments:

  1. A very refreshing and honest reflection, and the reason I use AAISP. Cheers.

    ReplyDelete
  2. Something odd about the later 15min outage was that for a while ICMP seemed to be working (I could ping 8.8.8.8) but TCP was not (I could not ssh into a remote server --- note that this was by ip address, no dns lookup involved).

    ReplyDelete
  3. Well - all worked today - phew.
    We understand route reflectors now - kings of BGP.
    Now we can move some of the interconnects over...

    ReplyDelete
  4. @RevK: thanks for the explanation it's much appreciated especially as this caught me at just the "wrong" time:

    My daughter was using her laptop watching some kids stuff on the BBC. I fired up my PC just as she was coincidentally shutting down the laptop.

    When the PC got going there was no net but I had just seen it working on the laptop (it was "The Sarah Jane Adventures") so it must be a problem with my PC or new-ish router mustn't it.

    Well no - I did consider that something external to my house may have failed but I thought the chances of it happening just then must be astronomical!

    Just goes to show that the heuristics one develops through decades of fault-finding can sometimes fail.

    Luckily it wasn't down for long (thought I'd fixed it by rebooting my router).

    ReplyDelete
  5. I'm not really familiar with IBGP. Can you expand on the historical reasons you use it? Or is it simply a case of "we've always used it - if it ain't broke..."?

    ReplyDelete
  6. IBGP is just BGP anyway. You have to use BGP externally and we have grown from two routers. When adding a couple of LNSs the full mesh downside of BGP is not reall an issue, but doing something new like OSPF was more work. We then added code that injects BGP from a database and a few oter things. They don't talk SPF as just simple code we wrote to wot with the existing setup. And so on. So a lot of "if it ain't broke" in there.

    ReplyDelete