Saturday, 19 August 2017

A report from the front lines at A&A

I don't really have anything to rant about this week, sorry...

A lot is happening at A&A, and that is really quite exciting.

Blips

It is probably worth starting by saying more on the issues we had with our Cisco switches some weeks ago. We do not use much Cisco kit, and indeed, until a couple of years ago I was quite proud to say we had no Cisco kit at all - our routers and LNSs are all FireBrick. However, whilst FireBrick are working on 10Gb/s routers, we don't have any FireBrick switches. So we did get Cisco switches, which do a tiny bit of BGP to carriers. Apart from those blips, these have worked well. We employed Cisco trained staff to set them up in the first place and have other staff that had gone on Cisco courses, and we have engaged another expert Cisco engineer on some occasions, including in the post mortem of these issues. It still looks a lot like one main issue, that happened twice, and a different issue that happened when we rebooted one of the switches the second time. We can be sure the issue is somehow in the Cisco switches. It also seems unlikely that it could be spanning tree or anything else like that - we had all BGP sessions to carriers stop, and each of these is a simple direct BGP to a directly connected endpoint on a specific port on the switch in question. Failure of BGP in that case should not be possible, even if every other port was shut down, even if all inter switch links had failed, BGP on the directly connected port should stay working. The fact this happened on all of these links covering four separate switches suggests something "upset" all of the switches by some means, and we have failed to actually get to the bottom of it conclusively. We have, however, set up a lot more logging, and made a number of "defensive" config changes which could cater for possible causes, albeit clutching at straws. It does mean that if it happens again we will be in a far better position to diagnose properly and involve Cisco TAC, as we will have the logs needed. I appreciate this does not sound good, and to be frank, it is not. However, they are being very stable now, and we do have all the redundant links back in operation, and all seems well! I hope customers can appreciate that we take this seriously. I hope we can put this behind us now, and the capacity of these switches allow for a lot more expansion of our network without adding any more complexity to their configuration.

LNS blips

We had careless, and e.gormless, have some issues in the last couple of weeks and need a reboot. This is a small portion of customers, but a pain none the less. It turns out the cause is the same for both, and we actually have found the bug (this is why I have Internet access when on a cruise ship). It is an issue on the FireBrick LNS code for a really specific edge case (aren't they all) which was causing a memory leak. We have a fix, obviously, but we have managed to deploy a work around for the specific one customer line that was triggering this. This means the next LNS rolling update will include the fix.

Talk Talk packet loss

Another issue we have had, and looks behind us, is the low levels of loss on the Talk Talk back-haul. Again, I think this is all sorted, and comes down to Talk Talk involving Juniper JTAC and making some significant changes to the way their network works just before it connects to us (and a lot of other ISPs). It is not just us that can have unexpected issues like this with industry standard routers and switches.

Moving forward...

But there are more things happening, and I thought I would touch on them. For obvious reason you have to take this all with a pinch of salt, things are not set in stone yet.

The new FireBrick...

Again, I cannot say a lot - we are launching a successor to the FB2700. The real news will come soon, when we have final application software running and can fully benchmark it. It should be a lot faster, as the FB2500 did 100Mb/s max and the FB2700 did 350Mb/s max. I am hoping for nearer 1Gb/s throughput. We are, however, pretty sure it will not do full table BGP. At this stage we are sorting EMC testing and final artwork and many other things - stuff can go wrong and delay for weeks or months.

When launched, which could be within a couple of months, it will have this extra performance, but we hope soon after to have additional software features if possible. I am hoping for much faster crypto (IPsec) to be honest, but again, until we finally get to benchmark it, we cannot tell. We just know that the underlying specs of the chipsets, even with the same s/w, should be a lot faster than the FB2700.

One the the reasons I am a tad vague is the throughput of things like this can massively depend on some of the low level features of the chipset. It is not enough to just say that CPU is faster and the RAM is faster - a lot of time is taken by cache management. Sadly the exact way the cache works in practice is not something one can fully glean from a data sheet as well as you would expect. We have been caught out in the past with an Intel based chipset for the prototypes of the current FB6000 where some simple operations that should ideally be one clock cycle literally took many hundreds of clock cycles and were needed on every interrupt, none of which was in the data sheet. We had to change the chipset for the current FB6000 series. I am optimistic for the successor to the FB2700, and expect things to come out well as the new chipsets "seem" to be really good. If they are as good as we hope, we will have a really really nice FireBrick. Worst case, we will have something better and faster than the FB2700. I also hope, cheaper, but that too is yet to be finalised.

There are, however, a couple of things I can confirm. For a change, one thing we have announced as "coming soon" before, is reality, and that is 19" rack mounting!


We have ears to allow one or two of the new FireBricks in a 19" rack mount fitting, or one in a wall mount fitting.

The other more subtle feature is a completely new power supply system. This means that, in addition to mains (110V/220V) we have DC supply options - two versions, one for automotive (12V and 24V), and one for telecoms racks (-48V). The DC options are actually a lot more complex than you would imagine as automotive has to handle some nasty spikes in some edge cases. I made the decision to have DC as an option, even if we expect relatively few customers needing them. It should also be a lot cooler!

As you will see from the picture, the final part is the SFP slot, which will allow fibre, copper, and maybe even VDSL based SFP modules to be used. Note VDSL SFP is outside SFP spec on power, so we are not sure yet, but it looks encouraging so far.

More capacity in A&A core

We have a lot of capacity now, and are not the bottleneck (which is always our aim), but we are working on yet more capacity. We have massive headroom on the Talk Talk backhaul, and we are adding more headroom to the BT back-haul. We are also updating the links we have on some peering to allow for more capacity to the likes of Netflix. A lot more 10Gb/s links are involved. This is all well ahead of usage, by some large margin. We are taking the "not the bottleneck" aim very seriously and making sure we are well ahead of the game in terms of increasing internet usage.

I know we are not the cheapest ISP, even if reasonably competitive in many cases, but making sure we are not the bottleneck so that you get the speed your line can handle is quite an undertaking. Quality matters.

Better tariffs

This is where things really are up in the air - they depend not only on things like increased capacity (as above) but also on complex negotiations with multiple carriers, increased capacity on peering and transit, and then a lot of work on our internal systems and ordering processes.

What am I hoping? Well, no commitments yet, but I am hoping for more download allowance on the Home/SoHo non terabyte tariffs, i.e. increased allowance at same price. I am also hoping to extend the terabyte packages to allow for more lines to have this, and upgrades to these packages to be easier. I am really hoping for better minimum terms, but that really is tricky as we can so easily be stung by carriers.

One thing I am really keen on is making the tariffs simpler and easier to understand, something we always strive for. I also want to make them more available to all, not just those where we can get Talk Talk back-haul. Sadly old 20CN lines will always be the legacy and exception, sorry, but these are gradually getting upgraded.

As always, new tariffs are available to existing customers when they come out. Some will be automatic (e.g. if we can increase usage allowances) and some you can order a regrade to a new tariff when you want. If you join A&A today, then you will benefit from new tariffs I hope to have in a couple of months time.

As a slight insight, trying to get better back-haul rates out of one carrier led to our lawyer calling the contract they sent "opaque as a brick", which says a lot for how hard some of this can be. He could not even advise if we should sign it or not and he is a really good lawyer.

Please do not hassle staff!

Some of my staff will be annoyed that I have posted this all as they will be fielding questions! Seriously, they do not know more than I have posted here. I do not know more than I have posted here yet. Please, just wait and see.

40 comments:

  1. Always nice to see this level of detail coming out of a company. You mention increasing bandwidth to Netflix, but do you see enough traffic to warrant an "open connect appliance"? As an amateur network enthusiast it's always interesting to hear about CDN nodes being deployed to transparently save bandwidth and latency

    ReplyDelete
    Replies
    1. @Matthew have a read of.... https://openconnect.netflix.com/en/requirements-for-deploying/

      A cache is worth while when an ISP is up to 3-5Gbps of NFLX traffic, but will still need ~1Gbps to keep itself up to date. We have single OCAs with 100G NICs that can saturate the port! Peering at 10Gbps over a public IXP is a good first step.

      Nat, (NFLX snr software engineer and happy AA customer)

      Delete
  2. Is the memory leak caused by the specific customer having BGP peering over bonded EoFTTP ? ;-)

    ReplyDelete
    Replies
    1. LOL, no, a specific, unexpected PPP sequence which was repeating every few seconds.

      Delete
  3. > We actually have the list of ISPs (by mistake) and it is quite a big list.

    The classic 'CC Fuck Up' ?

    ReplyDelete
  4. There was a post on Mikrotik forums about VDSL2 SFP modules.

    I'm wondering if such modules support VDSL pair bonding (as I used to have a pair bonded VDSL connection to get 50mbps) or does Openreach not use that?

    ReplyDelete
    Replies
    1. AFAIK they are one pair, and BT would have to support, which I don't think they do. However, we do bonded FTTP (bonded at the IP level) as standard, and get the throughput of both lines.

      Delete
  5. I suppose I ought to sell the FB2700 I bought 18 months ago and could never get my head round how to configure it. Barely used, and no doubt going down in value more than normal given the FB2900 developments. And I still don't have working IPv6 on the Zyxel, sigh.

    ReplyDelete
    Replies
    1. I am sure we said support could help with your config, it really is not that complex. I also said we'd send you a new zyxel, have you been in touch?

      Delete
    2. You said you'd help with individual questions on the FB2700 but that isn't the problem. The problem is I don't get the entire thing, it's overwhelmingly different to anything I've seen before. So answers to individual questions won't help. I did originally ask tech support if they had something that would convert a Zyxel config to a Firebrick config but they said no. That's the kind of thing I need, a working copy of my current config and then I suspect I'll be able to make changes from there.

      I've kind of lost the will to do anything with it, I have working internet (except IPv6) so it's just time down the drain. And I detest XML, it's just complexity for the sake of it for no actual benefit.

      You never said you'd send me a new Zyxel, we've had no contact since before the new model came out. What does it do that's different? There's not much detail on the AA pages at the moment.

      Delete
    3. Well, hard to know where to start. There is a manual for the FireBrick, but it should be pretty simple. In a PPPoE set up it literally connects on port 4 and provides internet on ports 1/2/3 with maybe the only setting needed being username - but the detail connects on an A&A line anyway.

      Anyway, as for zyxel, I am sure we can send you a new one. My understanding is that the main difference is IPv6 does work must better.

      Delete
    4. If you described what you wanted to achieve, someone might be able to give you a rough template for a FireBrick config?

      Delete
    5. I found the firebrick quite different, but the underlying principals are the same, Andrew / Rev have been very helpful on IRC, try reaching out in #Firebrick, otherwise, you can use the XML if you're comfortable with networking concepts or point and click in the gui if you're not.

      Delete
  6. That's a really brilliant suggestion from Matthew. I use Netflix, Amazon even more so, and Apple's content servers occasionally. These servers account for 95% of my download usage (wild guess). It would be good if AA could save some money. And possibly even increase reliability too (who knows) by removing external link dependencies.

    ReplyDelete
    Replies
    1. We already using peering for things like netflix, apple's CDN, etc, and yes, more direct peering for such things is exactly what we are doing. As you say, it is a large part of bandwidth usage.

      Delete
  7. Any hints from BT about what will be happening to those 20CN lines in future? ISPreview's post back in February said the plan was for all of them to be migrated to 21CN by the end of 2018, which would eliminate that problem - if that actually happens...

    ReplyDelete
    Replies
    1. I am not sure if we have anything concrete from them, but we are seeing them gradually disappear. In some cases it goes straight from 20CN to having FTTC available, but not 21CN ADSL (which makes sense when you realise how these small exchanges work). So yes, the problem will go away, eventually.

      Delete
    2. I suspect 20CN is probably going to die quickly simply because the cost of the logistics of keeping it running eventually approaches a point where it's cheaper to upgrade.

      Rev, what do you think will finally tip the scales far enough for BT to scrap ADSL as an offering entirely? It seems non-sensical to continue to maintain five sets of consumer service infrastructure in exchanges and cabs (I'm counting 20CN ADSL, 21CN ASDL, VSDL2, G.Fast, FTTP) when they could just migrate all ADSL to capped VSDL2 or G.Fast and only have to maintain two or three sets.

      Delete
    3. I heard rumour of an ADSL from the cab option at some point as well. Given that some lines are too long for VDSL then need some alternative.

      Delete
    4. I have a line which syncs considerably slower on VDSL than it did on ADSL2. Now stuck waiting for minimum term to pass in order to be able regrade downwards.

      Wouldn't want ADSL to be dropped just yet in this case!

      Delete
    5. I believe the alternative BT have in mind is their "Long Reach VSDL" but from what I can make of the details out so far that impinges on the ADSL bandplan and may require ADSL to be turned off in that exchange. This will probably complicate matters where LLU kit is present though.

      Delete
    6. Additional distribution points on poles could be a way way deal with the distance issue on VDSL. ADSL from the cabinet sounds like an interesting way to increase the speed for the longer lines, though surely more cabinets or distribution points closer to the premises would work better.

      Delete
    7. LR-VDSL2 involves running VDSL2 at powers that are so high they splat ADSL2+ completely. As a result, LR-VDSL2 in the real world has (so far) always synced at a higher speed than ADSL2+ on the same line (whereas VDSL2 may not sync higher, because it uses less power in the ADSL2+ spectrum, but gains higher speeds by using more spectrum).

      Delete
  8. G.Fast is now live and in the wild. Is that affecting planning and how? What will you need to do to take your first G.Fast orders?

    ReplyDelete
    Replies
    1. This partly depends on how this filters through BT Wholesale. A subject for another blog post at some point.

      Delete
  9. Great. Once our clients start seeing BT advertise their own Infinity-Max or Infinity-Ultra or Infinity-Mega-Super-Deluxe-Better-Than-The-Rest-Of-The-Whole-World or whatever they brand G.Fast as then they will all start asking "why can't your favoured A&A deliver this to me?" so will be good to say "they have it coming".

    ReplyDelete
  10. Looks like the NTE will need one of these new g.fast faceplates fitted:

    https://i0.wp.com/blog.cerberusnetworks.co.uk/wp-content/uploads/2017/07/gfast1.jpg?resize=300%2C225

    so that would mean any NTE5A needs replacing with an NTE5C as part of the g.fast install.

    ReplyDelete
    Replies
    1. Oh dear, another ugly as sin BT NTE sticking out of the wall like a large wart. I use the Solwise ones that look much nicer, the faceplate is very visible in my lounge and in my parents' hall so appearance matters whatever BT appear to think about that:

      https://www.solwise.co.uk/adsl_splitters-faceplates.htm

      Delete
    2. Only if you intend to use the line for voice. Surely nobody actually does that anymore :-))

      Delete
    3. As soon as Ofcom ends the insanity of mobiles being premium rate numbers, I'll switch... (Yes, the worst of the ripoff has ended, but until they reach parity any business trying to use a mobile number can get lost.)

      Delete
    4. Err, you can use VoIP on "normal" numbers you know...

      Delete
    5. Odd that we dropped the ball there - renumber and export is now a standard thing and can be ordered - it works well.

      Delete
    6. I have used Solwise faceplates in the past but 12 years ago a BT technician (not engineer: technician is the right word) whinged chronically about non-standard parts causing her problems, so since then I have always used BT branded faceplates. It eliminates one more potential reason for a BT technician to moan/complain/whinge/levy charges/blame me/etc.

      Delete
    7. Funny, when I had one the BT engineer actually fitted it for me! (This was back around 2005, ADSL - BT weren't providing splitter plates any more, it was all micro-filters, but since the engineer was there anyway to fix a voice fault he was curious about the benefits.) Worked fine for years, until replaced by the Openreach VDSL one as part of that install.

      Yes, in the deleted comment I mentioned I'd joined the A&A renumber-with-export trial, but never got processed. Now there's the possibility of a house move ahead, so I think I'll leave that plain PSTN for now and do a regular number-port-and-cease on the way out.

      Delete
    8. Found a practical issue with the NTE5C today. If you mount a row of them horizontally, side-by-side, then it is very difficult to remove the lower faceplate without some kind of implement to press in the side lugs. This defeats the purpose of the faceplate design being tool-less. So I mounted 4 of them vertically, thus ensuring the side lugs of each are fully accessible and press-able, which happened to look better in this particular scenario, but won't in many circumstances.

      Delete
  11. jas88 I think I zapped a comment, sorry - you said you tried the renumber and export when trialling and it did not work.

    ReplyDelete
  12. Just out of curiosity, which VDSL SFP modules have you been testing with? I'm currently building a home router with the Marvell MacchiatoBIN and would like to be able to ditch the Vigor 130 I'm currently using.

    ReplyDelete
    Replies
    1. I am not actually sure, sorry. I'll see if I can find out. Someone in ops team managed to source one!

      Delete
  13. One thing I'd like you to suggest to BT when you have one of your regular ISP meetings is the idea of being able to, at BT"s sole discretion, "fix" FTTC faults by replacing the copper with FTTP.

    Basically, give Openreach the flexibility to decide that, given the nature of the fault, it'll be cheaper to install FTTP for a group of premises rather than continue to repair copper; this would be entirely at Openreach's discretion, and would not be something you could force them to do.

    ReplyDelete