2012-06-06

Can BT run a back-haul network?

Well, for 5 days now we have seen some major issues on all 21CN Guildford BRASs.

BT have not been able to fix this, or even investigate this!

Latest is that Kevin from BTW said his shift was ending and to re-connect, and now I get :-
So what can I say? You decide if they are competent to run a major UK broadband back-haul network.

The latest issue, FYI, is complicated, but we are more than happy to provide as much diagnostics as possible. We monitor every line every second thanks to FireBrick LNSs.

We see two issues, no doubt related, and only on 21CN Guildford BRASs.
  1. Packet loss in evenings (afternoons as well, over the weekend) but only on some specific BRAS/LNS connections.
  2. Huge latency (evenings), but only on some lines.
The first point is specific to LNS/BRAS links, and adding new IP addresses to our LNSs can fix the problem. My conclusion is a typical LAG/hash issue. Some traffic based on IP/ports is via a congested link, and some is not. We see loss over 5% on this.

The second is more random and transient. Lines lose or gain this issue on PPP restart. My guess is the same issue on the other side. It probably includes PPPoE aspects like session number in the hash, and has different characteristics when full. We see latency to 100ms on this.

Some lines have no issue, some have latency, some loss, some both.


But are BT up to the job of finding the cause? Who knows.

And as for account managers in BT who actually "care". We had one once, but no more. Other ISPs please note - Guildford BRAS issues.

P.S. I have spent hours on this over this evening alone allocating new IPs to LNSs and bouncing links to get them clean. If BT cannot sort soon I will be working on new systems to allocate specific IPs for specific BRAS links to bypass this. Its is complex working around such a major outage but something we are willing to do to get AAISP customers the best service, even if this does affect all other BT connected ISPs.

Update: After several days BT have found a VLAN that is running hot. You would hope they monitor these things, wouldn't you! Waiting on a PEW to get it fixed now.

P.S. The incident team are actually quite helpful, but at the end of the day they are just calling BT Operate. Most of the people we know in BT Operate are quite helpful too. Nice, helpful people being unable to actually sort the problem makes it more frustrating in many ways. I hope they do get their act together so that they can see this sort of stuff and not wait for us to report it.

Update: Well over a week late we still have the latency issue... Thankfully the one man in BT with clue is back off holiday and will be on the case for me soon.

5 comments:

  1. We're seeing the same and have been for the last 3 weeks .. Finally last week BT took us seriously .. But still no response !!

    Ps. Thanks for the Demo FB's

    ReplyDelete
  2. If it helps.. there are graphs of affected lines here:

    http://beusergroup.co.uk/index.php?id=861

    All combinations. High ping only, PL only and lines with both.

    ReplyDelete
    Replies
    1. Sadly they are live graphs and today show nothing special. They may show the issue again this evening.

      Delete
    2. That is true. If you think it would help prod BT into action.. I am happy to give you the archived copies of any graphs showing symptoms.

      Delete
    3. Speak with Drsox he may be able to provide historical graphs, I believe F8lure stores the past 30days worth of tests.

      Delete

Comments are moderated purely to filter out obvious spam, but it means they may not show immediately.

Missing unix/linux/posix file open option

What I would like is a file open option for "create replacement file". The idea is that this makes a new inode in the same mount p...