Saturday, 7 February 2015

Congestion case study

The BT kind, not the sudafed kind...

I have not really needed to talk of backhaul congestion for some time. Many years ago, when BT first had congestion in their network they did not appear to have proper means to plan and manage the capacity. Thanks to our work constantly identifying congested links, they do now have departments to handle this. We've worked closely with BT on these issues over the years in a genuine effort to help them solve their problems and provide a quality service to all ISPs. Over the years this has been a bit of a roller coaster, and occasionally there have been problems for many months (such as when BT's BRAS back haul links had to be upgraded to 10G). Generally things have been OK for quite a while until a few months ago. Oddly we are seeing quite a few issues at the moment that are taking some time to get fixed, but BT are working on it.

The main way we can see congestion is because we have very good monitoring - an LCP echo every second on every line allows us to see packet loss and latency clearly. We then correlate trends over exchanges and BRASs and carriers to identify congestion before customers even need to contact us.

However, there is still the old 20CN ATM based network in BT and many 20CN only exchanges. These are still growing with more demand for bandwidth from existing lines, and more lines being added. This can also result in congestion. Obviously, over time, these are being upgraded to either 21CN ADSL or FTTC (or both).

Over the last two weeks I have done a bit of a case study on one customer on the SOUTH RAUCEBY exchange. Here are my findings...

Seeing the congestion

The first concern is that we cannot see the congestion any more! The LCP echo are not showing loss or latency even when the exchange has a lot of congestion. We think this is a change BT made many years ago to prioritise the LCP echo. This may have been to ensure routers do not drop the link due to a lost LCP echo/reply, but it could also be to make our graphs "look good" I suppose. Thankfully 20CN is a minority now, but we do have to rely on customers telling us 20CN congestion issues.

Overloaded link

This particular customer contacted us some time ago advising that there was congestion, and we contacted BT. As it happens, BT said they had just set up another DSLAM (or another shelf in a DSLAM, I am not sure), and they would move this customer over to that. This solved the problem for our customer, great. However, some months later, he once again has congestion. This does rather suggest they have issues with monitoring the links, upgrading links, and planning rules, one way or another.

This is an example :-


This shows the upload (red) and download (green). This is an attempt to fill the line, which should be able to get close to 7Mb/s of IP throughput but is in fact only getting 1Mb/s at best.

What can be done?

Sadly, we have still not managed to get BT to do anything to fix this yet - it is often an uphill struggle and they may even deny there is an issue. So we considered some alternatives. There are two things we considered. One thing to try is to order "premium" on the 20CN service. This offers a higher upload speed and also elevated weighting in the network.

However, we also had another cunning plan. We lent the customer a FireBrick FB2700. We then told our end to mark all of the IP packets as if they were LCP. This is a feature in the FireBrick which was added to work around problems with a faulty DSLAM that refused to allow IPv6 PPP packets, but we had a hunch it may help here. The fact that our LCP echo always seemed fine seems to suggest that BT prioritise LCP traffic. So we gave it a try.

This is the result :-


As you can see, the download is close to the line (which shows the BRAS rate that should be possible). It is not 100% perfect (the line is a bit wobbly), but very close, and massively better than before.

Does "premium" help?

We also added premium to see if that would help, and it does (as expected). This is an example of premium, with and without the IP over LCP being used.


Just before 20:00 is premium using normal PPP coding for IP packets. After 20:00 is premium and LCP coding for IP packets. As you can see, after 20:00 there is a solid line at the limit - the best performance yet. The graph is a log scale, so it may not be obvious, but without LCP marking the line is achieving approximately half the full speed.

Conclusion

Premium improved from under 1Mb/s to around 3.5Mb/s. However marking IP as LCP improved both premium and non premium to full line speed (with premium is slightly better).

Will BT fix the problem?

We hope so - we will, of course, continue to pursue BT over this congestion. However, our customer has three lines, and all of them can now hit the full line rate over around 7Mb/s at once when he is downloading. He has done speed tests showing nearly 20Mb/s over the three lines. Considering how poor the performance was previously, it seems likely that when our customer is downloading, everyone else on the same VP backhaul in SOUTH RAUCEBY will probably have totally unusable Internet. With any luck they will complain to their ISP (probably BT retail) and help get the backhaul fixed properly.

I'd like to thank domb for all his help on this.

Tech note: The customer is considering patching pppd. What we actually do is simply mark an IPv4 or IPv6 packet as LCP providing the packet starts 4X for IPv4 and 6X for IPv6. At the receiving end an LCP packet starting 4X or 6X is assumed to be IPv4 or IPv6 respectively, so no extra overhead. Genuine LCP codes do not get anywhere near as high as 40. Maybe we should do an RFC :-)

10 comments:

  1. Probably being silly, but out of interest does the latency show if ICMP Echo is used instead? I guess it's all down to what BT want to prioritise.

    ReplyDelete
    Replies
    1. Yes, and we can set up additional ICMP based graphing where we need. That shows the normal loss/latency you expect for congestion.

      Delete
    2. So to the layman it would appear that BT are more interested in making the graphs look good rather than making the graphs actually good. Do you experience this sort of figure fiddling with your other providers?

      Delete
    3. Well, being generous to them, it may be that they are trying to ensure no LCP timeouts rather than "looking good", but who knows. They are well aware of our graphs and what happens when they don't "look good", so who knows. Other carriers seem OK.

      Delete
    4. Being ungenerous but reasonable, I suspect it's about ensuring that the "Internet" light on your Home Hub doesn't stay out due to congestion.

      Assume that the mass market ISP helldesks advise all users with problems to restart the ISP-provided router as part of the troubleshooting process. If LCP isn't getting through due to congestion, restarting the router will cause the Internet light to go out and stay out. This in turn will trigger helldesk into raising a line fault due to lost PPP but sync present, and cause BT Faults a whole load of head-scratching.

      On the other hand, if LCP is high priority, restarting the router won't fix things, but it won't break things either; you log back in, and hey presto! congestion still present. This causes helldesk to raise a congestion fault, which will not cause BT Faults any headaches, as they know about it and can simply close it as "within service parameters".

      Delete
  2. This customer should consider themselves lucky. I get 3 megabits download on a very unstable long line that drops out in wet weather. Some people don't know when to count their blessings.

    ReplyDelete
    Replies
    1. He is lucky that he is that close to get the better rate even on 20CN, and he pays for there lines (so not really luck involved there). He should be even more lucky soon when FTTC arrives.

      Delete
    2. Yeah, and I've got about another year to wait for FTTC. Sometimes I feel I'm getting paid back for bad broadband karma or something.

      Delete
  3. Rather than mark the packets as LCP, why not just color the packets with the appropriate QoS (DSCP?) values?

    ReplyDelete
    Replies
    1. Not sure if that works on 20CN anyway, but it would only do something useful if we paid for BT QoS stuff as well.

      Delete