Friday, 6 March 2015

Helping BT

Just to counter the idea that we are always shouting at BT it is worth explaining that we do help them as well!

Today we have been helping them diagnose a link fault. We have seen a BRAS that is not working for a load of lines, but only to one of our LNSs. We have found another BRAS with the same issue from the same time but to a different LNS. We change the IP of the LNS and the lines start working (albeit on the same box).

Some time ago we set up alternative IP addresses on each LNS because of an occasional fault condition that can happen where a Link Aggregation Group develops some sort of fault on one part of the group. The typical way a LAG works is that the specific link used for any traffic is based on a hash of the traffic (IP addresses each end. maybe port numbers, and even MAC addresses). The idea is that a "flow" of traffic ends up on the same link avoiding any reordering issues, but load gets reasonably well shared out.

The impact of a link failure on a LAG is that specific combinations of IP addresses at each end will go via the faulty link, meaning that from a specific BRAS to a specific LNS does not work.

We have improved our system of managing alternative IP addresses now so that we are able to quickly switch addresses more easily to allow customers to get on line when we have this sort of issue.

But today I, and Shaun at the office, have been on calls with senior network engineers in BT (they called us!) to help them out, and understand where the problems are on what IPs and what BRAS so they can try and locate the underlying problem.

We could just say "It's broken, fix it! fix it! fix it!", but no - we are working with BT to help resolve issues and provide as much information as possible. We know these things can be a pig to find and the more accurate data you can get, the better chance of finding it.

And what do we charge for this valuable and special fault investigation service - bugger all!

4 comments:

  1. Maybe tweet this to BT also RevK :)

    ReplyDelete
  2. "We have improved our system of managing alternative IP addresses now so that we are able to quickly switch addresses more easily to allow customers to get on line when we have this sort of issue."

    I'm glad to see this in place, having been a customer affected (the first?) and suggested exactly that last month.

    Is the BT fault-finding entirely ad-hoc, or is there a more sensible fault-handling process in place now for "that line can't communicate reliably with our LNS - this is not a line fault so do not attempt SFI"? (My best guess last time was to try raising it as a fault against the A&A end of the link rather than the end-user end.)

    ReplyDelete
    Replies
    1. The alternative IP was all manually adjusted before. This sort of thing happens rarely, but it is clear we needed to make it slicker. It is, however, not always obvious that it is an LAG issue. When we saw this particular fault it was all lines, but by fluke all lines were on the same LNS at the time. Had there been more lines the LNS specific effect would have been more obvious as we would have tried the alternative IPs anyway. The big concern is that BT seem to lack any monitoring or alarms for links within an LAG.

      Delete
    2. What surprised me in that tale was how reactive BT's fault-ignoring process was - to the extent that when you eventually managed to prod them into investigating my fault properly, they found and fixed multiple unrelated BRAS faults in the process (which were presumably affecting other lines) before eventually stumbling across and bypassing the faulty 10G card/link that was to blame.

      I'd always assumed they had monitoring in place that would alert the NOC when a backbone segment was showing errors like that. Perhaps they only monitor the aggregate as a whole, so data loss on one component gets diluted below the alarm threshold?

      Things seem to have gone awfully quiet since the initial flurry of network diagrams and presentations about 21CN; a Google search still shows some people panicking about BT shutting 20CN down in 2014, and a distinct lack of information from recent years! About time somebody got an update out of BT on that front.

      Delete