RevK®'s ramblings: Helping BT

2015-03-06

Helping BT

Just to counter the idea that we are always shouting at BT it is worth explaining that we do help them as well!

Today we have been helping them diagnose a link fault. We have seen a BRAS that is not working for a load of lines, but only to one of our LNSs. We have found another BRAS with the same issue from the same time but to a different LNS. We change the IP of the LNS and the lines start working (albeit on the same box).

Some time ago we set up alternative IP addresses on each LNS because of an occasional fault condition that can happen where a Link Aggregation Group develops some sort of fault on one part of the group. The typical way a LAG works is that the specific link used for any traffic is based on a hash of the traffic (IP addresses each end. maybe port numbers, and even MAC addresses). The idea is that a "flow" of traffic ends up on the same link avoiding any reordering issues, but load gets reasonably well shared out.

The impact of a link failure on a LAG is that specific combinations of IP addresses at each end will go via the faulty link, meaning that from a specific BRAS to a specific LNS does not work.

We have improved our system of managing alternative IP addresses now so that we are able to quickly switch addresses more easily to allow customers to get on line when we have this sort of issue.

But today I, and Shaun at the office, have been on calls with senior network engineers in BT (they called us!) to help them out, and understand where the problems are on what IPs and what BRAS so they can try and locate the underlying problem.

We could just say "It's broken, fix it! fix it! fix it!", but no - we are working with BT to help resolve issues and provide as much information as possible. We know these things can be a pig to find and the more accurate data you can get, the better chance of finding it.

And what do we charge for this valuable and special fault investigation service - bugger all!

4 comments:

JohnnyDFriday, 6 March 2015 at 22:05:00 GMT
Maybe tweet this to BT also RevK :)
ReplyDelete
Replies
jas88Saturday, 7 March 2015 at 10:12:00 GMT
"We have improved our system of managing alternative IP addresses now so that we are able to quickly switch addresses more easily to allow customers to get on line when we have this sort of issue."

I'm glad to see this in place, having been a customer affected (the first?) and suggested exactly that last month.

Is the BT fault-finding entirely ad-hoc, or is there a more sensible fault-handling process in place now for "that line can't communicate reliably with our LNS - this is not a line fault so do not attempt SFI"? (My best guess last time was to try raising it as a fault against the A&A end of the link rather than the end-user end.)
ReplyDelete
Replies