I have not really needed to talk of backhaul congestion for some time. Many years ago, when BT first had congestion in their network they did not appear to have proper means to plan and manage the capacity. Thanks to our work constantly identifying congested links, they do now have departments to handle this. We've worked closely with BT on these issues over the years in a genuine effort to help them solve their problems and provide a quality service to all ISPs. Over the years this has been a bit of a roller coaster, and occasionally there have been problems for many months (such as when BT's BRAS back haul links had to be upgraded to 10G). Generally things have been OK for quite a while until a few months ago. Oddly we are seeing quite a few issues at the moment that are taking some time to get fixed, but BT are working on it.
The main way we can see congestion is because we have very good monitoring - an LCP echo every second on every line allows us to see packet loss and latency clearly. We then correlate trends over exchanges and BRASs and carriers to identify congestion before customers even need to contact us.
However, there is still the old 20CN ATM based network in BT and many 20CN only exchanges. These are still growing with more demand for bandwidth from existing lines, and more lines being added. This can also result in congestion. Obviously, over time, these are being upgraded to either 21CN ADSL or FTTC (or both).
Over the last two weeks I have done a bit of a case study on one customer on the SOUTH RAUCEBY exchange. Here are my findings...
Seeing the congestion
The first concern is that we cannot see the congestion any more! The LCP echo are not showing loss or latency even when the exchange has a lot of congestion. We think this is a change BT made many years ago to prioritise the LCP echo. This may have been to ensure routers do not drop the link due to a lost LCP echo/reply, but it could also be to make our graphs "look good" I suppose. Thankfully 20CN is a minority now, but we do have to rely on customers telling us 20CN congestion issues.
This particular customer contacted us some time ago advising that there was congestion, and we contacted BT. As it happens, BT said they had just set up another DSLAM (or another shelf in a DSLAM, I am not sure), and they would move this customer over to that. This solved the problem for our customer, great. However, some months later, he once again has congestion. This does rather suggest they have issues with monitoring the links, upgrading links, and planning rules, one way or another.
This is an example :-
What can be done?
Sadly, we have still not managed to get BT to do anything to fix this yet - it is often an uphill struggle and they may even deny there is an issue. So we considered some alternatives. There are two things we considered. One thing to try is to order "premium" on the 20CN service. This offers a higher upload speed and also elevated weighting in the network.
However, we also had another cunning plan. We lent the customer a FireBrick FB2700. We then told our end to mark all of the IP packets as if they were LCP. This is a feature in the FireBrick which was added to work around problems with a faulty DSLAM that refused to allow IPv6 PPP packets, but we had a hunch it may help here. The fact that our LCP echo always seemed fine seems to suggest that BT prioritise LCP traffic. So we gave it a try.
This is the result :-
As you can see, the download is close to the line (which shows the BRAS rate that should be possible). It is not 100% perfect (the line is a bit wobbly), but very close, and massively better than before.
Does "premium" help?
We also added premium to see if that would help, and it does (as expected). This is an example of premium, with and without the IP over LCP being used.
Just before 20:00 is premium using normal PPP coding for IP packets. After 20:00 is premium and LCP coding for IP packets. As you can see, after 20:00 there is a solid line at the limit - the best performance yet. The graph is a log scale, so it may not be obvious, but without LCP marking the line is achieving approximately half the full speed.
Premium improved from under 1Mb/s to around 3.5Mb/s. However marking IP as LCP improved both premium and non premium to full line speed (with premium is slightly better).
Will BT fix the problem?
We hope so - we will, of course, continue to pursue BT over this congestion. However, our customer has three lines, and all of them can now hit the full line rate over around 7Mb/s at once when he is downloading. He has done speed tests showing nearly 20Mb/s over the three lines. Considering how poor the performance was previously, it seems likely that when our customer is downloading, everyone else on the same VP backhaul in SOUTH RAUCEBY will probably have totally unusable Internet. With any luck they will complain to their ISP (probably BT retail) and help get the backhaul fixed properly.
I'd like to thank domb for all his help on this.
Tech note: The customer is considering patching pppd. What we actually do is simply mark an IPv4 or IPv6 packet as LCP providing the packet starts 4X for IPv4 and 6X for IPv6. At the receiving end an LCP packet starting 4X or 6X is assumed to be IPv4 or IPv6 respectively, so no extra overhead. Genuine LCP codes do not get anywhere near as high as 40. Maybe we should do an RFC :-)