Tuesday, 17 August 2010

Making a resilient network

Interesting thought came up today - can we market a resilient network at all?

Then I get to thinking what does that mean - well, I think it means, to me, that the network can tolerate various types of unexpected issues and still work. The dictionary says "springing back; rebounding" and "recovering readily from illness, depression, adversity, or the like; buoyant." so that sort of fits.

A resilient network is one that recovers from problems.

So, we try and do that - we have dual links to our favourite telco and our other favorite telco ( :-) ) and we have dual LNS and dual switches and dual routers and dual DNS and dual RADIUS and dual transit and even dual local peering, all in a data centre with redundant power supplies and so on.

The theory is that if any one of these breaks, things carry on. Ideally they do so seamlessly, but if there is an outage it should be a matter of a few seconds and recover all by itself without us having to intervene.

Now, it is not really quite possible to cover every scenario. Multiple failures are unlikely, thankfully. The biggest issue in any such system is partial failures where something is working enough not to trip fall back systems but is in some way broken or intermittent. Thankfully in such a case manually shutting down the ill part of the system can switch to the other half.

Of course we have to have lots of monitoring. If you have dual systems and don't know one has died then you are in the shit. We also have spares on site to allow for quick replacement or repair when things do break, or for the dreaded multiple failure scenario.

It all sounds good, but to be honest it is just best practice and business as usual for any ISP, even a small one like us. Indeed, if not for the way bandwidth gets charged by our favourite telco we would probably have a dual site redundancy with a second dual fibre interconnect to them, even as a small ISP. I know far smaller ISPs with just as much attention to resiliance, if not more. I can't say we have always managed it, and a few years ago we found to our peril that we had a single point of failure and the backup kit was never tested, but these days I can say we are a lot more careful. We can never be totally foolproof, but we do take it seriously.

The problem comes with the telcos. Sadly, even telcos that have decades of experience with the telephone network seem to lack the basics when it comes to data networking. They routinely have single points of failure and links that do not automatically fall back. It is scary. Any fault that can take hours to fix smacks of not having a resilient network. Any planned maintenance that can take services down for hours smacks of not having a resilient network.

Like a BRAS that suddenly starts crashing and you can't get working - well switch to the backup BRAS surely? What, no backup BRAS? What, need a card and there is not a spare on site? WTF? Not a resilient network.

The only real option we have, and is what we do, is use two independant telcos, so that we can extend our resilience out to multiple back-haul networks, multiple DSLAM, and multiple copper pairs to multiple routers. It works well. It does not mean one is inherently better than the other (even if that does seem to be the case), just that it is really unlikely to have an issue on both at once (bulldozers permitting).

Of course, we can, and do, go one step further. There is always going to be a single point of failure in our network whatever we do - and that is the aministration/management. We could fail in the way we run things. We could accidentally break the whole thing. So we work with at least one other ISP where customers want dual ISP resilience too. That is something no one company can ever offer.

So, I think we offer what people need to have resilience even if the telcos do not.

3 comments:

  1. Two major outages in RandomMajorTeco's network in the last week: One wiping out ambulance call centres in Scotland, the other taking out 45k lines around Wrexham...

    http://www.bbc.co.uk/news/uk-wales-11028719
    http://www.bbc.co.uk/news/uk-scotland-10955903

    All very resilient.

    ReplyDelete
  2. Considering how toothless Ofcom and ASA are, you can market anything you like 'resilient' (c.f. 'unlimited').

    At least you are sticking to the spirit of such terms...

    ReplyDelete
  3. Let's not forget the drill that went through a cable tunnel in East London and took out vast swathes of t'innernet for days. A couple of years back, IIRC.

    ReplyDelete