2011-08-30

Redundancy

Well, some people might like this to be making the man driving the JCB redundant, but not quite. More a matter of how to get a reliable Internet connection...

I do feel a tad sorry for some of our Ethernet customers, a few of which have had service long enough to have three outages. The latest today, but previous ones were power issues in the data centre. One involved a cat turning to charcoal on an HV transformer and one involved some sort of fire. Very bad luck to have this many issues, really!

The problem is that nothing is 100% reliable. We even have a data centre that guarantees 100% uptime for power, and so pays a few hundred quid if the power goes off. So that is a few pence per affected customer if we shared it out. Hmmm.

I hope we make it clear what risks there are and what can be done to reduce them when we sell our services. We aim to, certainly.

The two key topics here today are power and fibre.

Power: I know if I had put the stuff in my garage at home it would have had better uptime. That is fluke by far, and it would have had at least one outage. But data centres do a lot to ensure power works. They have generators and UPSs that means if the mains fails then they keep working without missing a beat. So when there is an outage, it is a failure of these lovely complex systems they have installed. One data centre power issue fried half our equipment! Even so, a data centre is a really reliable place to put equipment. Power outages should simply not happen and be short if they do. Adding a UPS in a rack would cause problems (both in terms of current used to charge it if everyone did this and by adding a new single point of failure). Sadly most data centres make it a PITA to have two separate feeds to a rack for dual fed kit, which would be the best option. At least we can dual feed from separate power control bars to help against some errors and failures.

Fibre: It really is inherently better than copper pairs. It has a really strong fibre tube surrounding it, and is not vulnerable to almost anything copper is, including RF interference and damp corroding connectors. Fibre just works until you break it. Don't touch it and it will work forever, pretty much.

Sadly the natural predator of fibre is the JBC. They are predators for copper too (along with some people nicking it to sell for scrap metal). A JCB, or just a badly placed jack hammer for some road works, can break stuff. It will break stuff.

The big problem is logistics of running fibre (or copper). Ducts are expensive, and ducts from A to B (data centre to exchange) need to take a single obvious route. In some cases they have no choice to be one physical point they pass (especially if bridges are involved). So you will have a single point of failure.

End result, yes, a fibre break today affecting our office and our Ethernet customers (separate fibres) which also affected other circuits in blue square. Lets hope BT sort it this evening.

We are considering a second POP for the Ethernet services. It will be a costly option but people will be able to buy two links separately terminated at our end.

Obviously we rely on the Internet ourselves. We don't just have the fibre, we have DSL lines and 3G backup as well. We have phones set up to divert to staff mobiles. We can cope. We found our plans lacking a tad and that delayed matters a while (stupid damn routers) so we'll be testing plans every month now to make sure they are not out of date. But we do have a plan.

The problem is people that do not. No matter how reliable data centre power is; No matter how reliable fibre links are; They will break one day. So have backup! Have a second backup! Have a third backup... Have a plan!

We have managed to get our IPv4 and IPv6 connectivity, phones and the various direct connect customers to us, all back on line this afternoon. With revised plans we'll be able to do it in seconds "next time". If we check the plan each month then that will not go wrong.

If 100% Internet access is essential to your business then you have to have plans, and processes to check those plans still work, and backup links, and so on, all in place, ready for something you hope will never happen. Redundancy...

Think about it before something breaks!

3 comments:

  1. Migration to DC customer equipment and DC power distribution would eliminate another possible point of failure, as well as making significant savings due to the higher efficiency and ease of cooling of DC plant compared to UPSes and onboard AC-DC converters in the equipment. I wonder when, if ever, data centres will follow the lead of telcos.

    ReplyDelete
  2. I work from home... the office is little over an hour's drive away, so I do anything I can to avoid going there... I have my DSL, I have my 3G wireless router, I have my 3G phone that can act as a hotspot (on a different network that's not mast sharing with the one on the 3G router), I have a UPS to buy me time so I can wheel out the generator (with spare fuel on hand) and start it up... if that all fails, a short drive away in the next town there's free municipal wireless that's not reliant on FaveTelco...

    I think I might have taken it a little bit too far, although when we had an utility company engendered power cut (they JCBed through the local loop cable in error) this week ours was the only house that remained well-lit and fully internet connected throughout :)

    The most common thing to let me down is FaveTelco of course with the DSL (not locally, the ESR that steers the PPP AUTH in their backbone seems to regularly keel over after a "transparent upgrade" has taken place).

    ReplyDelete
  3. This is why, as much as we use (and love you) that we use BeThere (natively) for offices whom want their Internet connection enough to pay for a second line. While line issues & BT wholesale are a major cause of failure, nothing beats avoiding both the issues at AAISP & BT & line issues.

    Testing monthly is good. Today the Moorgate exchange went down. We and one other company in our office using AA & Be were fine and didn't miss a beat. The other company using AA & Be found out at this point that their BeThere box was faulty.... :)

    ReplyDelete

Comments are moderated purely to filter out obvious spam, but it means they may not show immediately.

Breaking my heart

One of the things I suffer from is tachycardia. My first memory of this was in secondary school, when I got a flat tyre cycling to school an...