Wednesday, 28 March 2012

Night shift

So, it is 02:26 and my phone goes mental with texts from nagios in Amsterdam. I get up, check laptop and *I* have no connectivity!

We have nagios monitor a lot, from the office, and then we have another nagios in Amsterdam to monitor the monitoring. These texts tell me everything at the office is down...

Fortunately, using my 3G on the iPad I can see that the rest of the world is working (yes, all of it... google was there as well as our broadband lines).

So I am thinking power outage at office? But we have a UPS on the fibre, switch and router. How is it I cannot even see the office from home? That can't be it then.

So I am thinking fibre break. But that would mean both fibres (one to office and one to my house). That is possible maybe, but the DSL backup at the office should kick in - we test it every month!

So I am thinking I have no idea what is going on - cycle to office at 2:30 in the morning. Not funny.

Guess what. The damn UPS has died, so no fibre, no DSL backup, no switch and no fibre links. Arrrrrrg!

Now, this was just our office, but people have asked, on the rare occasions that there are power issues in the data centre, why we don't have UPSs ourselves. The answer is that (a) that almost certainly means a container of acid in a data centre; (b) it means our rack is live when power is off for safety reasons such as a fire; (c) it is probably against DC rules (see a and b); (d) people we connect to would be off so not a big help; (e) the DC has good UPSs; (f) it adds a new point of failure... So if ever we did have UPS in data centre (and were allowed to) we would only have it on one side of the dual feed kit. Today proves how having a UPS can make things less reliable than not having one.

Anyway, back to bed...

8 comments:

  1. Reliability is sometimes quite a difficult concept. I have had to explain on more than one occasion that if you start with a system that has a single point of failure, and you add a new device without thinking it through properly, you will then have two single points of failure, not zero.

    ReplyDelete
  2. Maybe its just a coincidence, but our UPS reports a 1 second power outage last night @ 2.25, then another @ 3:05. This is about a mile away from your Bracknell offices.

    ReplyDelete
  3. Unfortunately I've seen quite a few problems caused by UPS failures - my favourite being one where the protected main powering our desktops failed, but the unprotected circuit powering the coffee machine was fine.

    I've taken to wrapping UPSs in an automatic transfer switch. They take two mains inputs and connect one through to a single output directly via a relay. If that source fails, it switches to the other. As it's just a relay they're quite reliable, though you should ensure that the UPS feed is the primary.

    ReplyDelete
  4. As Timthorn was saying, look at ATS' from WTI (http://www.wti.com/p-166-pts-8ne15-1-automatic-power-transfer-switch.aspx) and TrippLite. Basically you feed a PDU both UPS and secondary power (could be another UPS, could be commercial power) and if the active one fails it switches to other. It would have addressed the issue you had.

    ReplyDelete
  5. I've never seen a UPS fail out of the blue, only when there's a power cut and it fails to take over (dead batteries are a favourite).

    If you have devices with redundant PSUs (servers, usually) then you connect one direct to the mains and the other via a UPS, assuming you can't afford two UPSs.

    And change the batteries at regular intervals! (Sad, when they cost so much, but the alternative is a surprisingly quick running out of juice).

    Cheers,
    Howard

    ReplyDelete
    Replies
    1. Be careful with connecting redundant PSUs to both mains and a UPS, especially if you have diverse power feeds. If your UPS is set to auto-powerdown the servers, power failure on the UPS feed will still initiate shutdown even if the redundant feed is still live. Also remember that the reported runtime of the UPS during normal ops will be 2x or more than the real runtime - I've seen some awkward moments as a result of that!

      Delete
    2. APC let you run their UPS's as an N+1 farm with the right software. If set properly, if one UPS fails, then the other won't initiate the shutdown unless the relevant redundnacy has been breached.
      The load is evenly split across the UPS's for servers with dual feeds etc. For routers firewalls etc, you have to take a chance, unless you have dual links etc. Cisco StackPower is quite funky for situations like this - no need for additional PSU's have and reduce your MTBF....
      I did my loadings based on one UPS having to run the rack - i've done a full load test and it works fine.
      Also set your UPS's to self load test regularily and at different times in case one dies - which I have seen before.
      Also, its a good idea as with anything to pull the plug and see what happens every now and then :)

      Delete
  6. I stopped using UPSs at home because they kept failing and killing everything. Plus they take up a lot of space ;)

    ReplyDelete