tag:blogger.com,1999:blog-3993498847203183398.post6972689868475483816..comments2024-03-28T09:19:27.451+00:00Comments on RevK<sup>®</sup>'s ramblings: Pitfalls of system redundancyRevKhttp://www.blogger.com/profile/12369263214193333422noreply@blogger.comBlogger18125tag:blogger.com,1999:blog-3993498847203183398.post-52030346043524972152017-07-17T15:24:25.067+01:002017-07-17T15:24:25.067+01:00We avoided cisco for a very long time, and now onl...We avoided cisco for a very long time, and now only on these high speed switches.RevKhttps://www.blogger.com/profile/12369263214193333422noreply@blogger.comtag:blogger.com,1999:blog-3993498847203183398.post-67676219187400932302017-07-17T15:22:34.079+01:002017-07-17T15:22:34.079+01:00we dont use Csico but even on HP switches LACP is ...we dont use Csico but even on HP switches LACP is a pain.big Dhttps://www.blogger.com/profile/17289091926427601758noreply@blogger.comtag:blogger.com,1999:blog-3993498847203183398.post-5029170823124715962017-07-16T20:45:10.140+01:002017-07-16T20:45:10.140+01:00How about your clients on Ethernet-over-FTTC, Ethe...How about your clients on Ethernet-over-FTTC, Ethernet-over-Copper and Ethernet-over-Fibre? Were they impacted?Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-3993498847203183398.post-21623075530672021092017-07-16T18:56:15.874+01:002017-07-16T18:56:15.874+01:00It was worse than that, as I recall: they moved fr...It was worse than that, as I recall: they moved from claiming that any admission of problems was bad PR, to claiming, hilariously, that as a public company any problem with their service whatsoever constituted something that might affect the stock price, so *obviously* any status reports whatsoever had be written by lawyers and had to go through several days to weeks of legal checking before anyone outside the company was allowed to know about it.<br /><br />(Because, of course, there's no way anyone could possibly learn about any of their interminable total internal routing failures other than via a lawyer-vetted statement to shareholders. At least, not after everyone had got sick of this standard of mushroom-management "service" and stopped actually using them as an ISP, that is.)Nick Alcockhttps://www.blogger.com/profile/06590610308528769844noreply@blogger.comtag:blogger.com,1999:blog-3993498847203183398.post-77628611165094647092017-07-16T10:35:38.982+01:002017-07-16T10:35:38.982+01:00Openness is very important - back in the day when ...Openness is very important - back in the day when Demon were the "ISP for techies" they were always very open about things they got wrong, even if it meant admitting that an outage was caused by someone unplugging the wrong thing. Then they got bought out and the new bosses seemed to consider admitting any failure to be bad PR. After a several episodes of spending hours debugging my own end only to find out that it was the ISP's fault (and they they almost certainly knew that all along but weren't telling anyone) I dropped them.<br /><br />I do wonder how many other customers "improving their PR image" cost them and whether it actually benefited them. That said, non-techies frequently seem to be mostly interested in finding someone to blame instead fo accepting that mistakes happen and you just have to do your best to fix things ASAP, so maybe it did...Steve Hillhttps://www.blogger.com/profile/09798286430189689578noreply@blogger.comtag:blogger.com,1999:blog-3993498847203183398.post-51783836834956147112017-07-16T03:07:23.348+01:002017-07-16T03:07:23.348+01:003G backup wasn't really working any better tha...3G backup wasn't really working any better than DSL. My router was happily switching back and forth between the two but traffic was intermittent on both.<br />I'd say its really only for cases where BT takes out your DSL line (as happened to me recently). Its unlikely to help when the ISP itself is broken. <br />Mine also has an issue with failing over the static IP blocks. I have to tweak it manually from the control pages (something to do with the 3G logging in before the ADSL goes down) so that makes it less than useful anyway.<br />Alannoreply@blogger.comtag:blogger.com,1999:blog-3993498847203183398.post-68839832420023423452017-07-15T19:13:19.748+01:002017-07-15T19:13:19.748+01:00I've seen this failure model before, right pai...I've seen this failure model before, right pain in the rear end to identify... In this case a 6509 with a faulty DFC<br /><br />What model of switch is it?Stevehttps://www.blogger.com/profile/15092831966981493678noreply@blogger.comtag:blogger.com,1999:blog-3993498847203183398.post-4051650113280115822017-07-15T18:53:04.348+01:002017-07-15T18:53:04.348+01:00Part of Office::1 is mobile backup, with us or oth...Part of Office::1 is mobile backup, with us or others, and mostly our mobile data was working.RevKhttps://www.blogger.com/profile/12369263214193333422noreply@blogger.comtag:blogger.com,1999:blog-3993498847203183398.post-9191818442705064542017-07-15T18:24:36.139+01:002017-07-15T18:24:36.139+01:00It would be interesting to hear from your perspect...It would be interesting to hear from your perspective, what, if anything, clients could have done to avoid being impacted. It sounds like customers on both BT and TT backhaul were affected so the redundancy of Office::1 won't have helped much. But what about your customers who bought your 3G/4G dongles - did their dongles work throughout? Anything else clients could have done?Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-3993498847203183398.post-23773614867223530522017-07-15T05:12:48.030+01:002017-07-15T05:12:48.030+01:00Was one of the LACP members not carrying traffic (...Was one of the LACP members not carrying traffic (in one both directions) during this time?Frank Bulkhttps://www.blogger.com/profile/02004215342995023858noreply@blogger.comtag:blogger.com,1999:blog-3993498847203183398.post-14039596212553605022017-07-14T23:02:43.766+01:002017-07-14T23:02:43.766+01:00We've seen various issues like this in the pas...We've seen various issues like this in the past - some hosts work fine, others not so much - where there has been packet loss on one leg of an LACP link. Packet loss rather than outright failure has meant that the usual methods of detecting link down didn't apply.<br /><br />What would be useful is a deterministic method of being able to send traffic down a specific link in an LACP group, for test purposes, on a (semi-)automated basis. I've yet to see such a feature.<br />ajvhttps://www.blogger.com/profile/11061954864640918013noreply@blogger.comtag:blogger.com,1999:blog-3993498847203183398.post-49922776767116224342017-07-14T22:53:21.967+01:002017-07-14T22:53:21.967+01:00Always good to see a breakdown of what happened an...Always good to see a breakdown of what happened and I really do feel your pain. I've had a few similarly mystifying half broken Cisco switches myself, although never that exact issue.<br /><br />Top marks for the switch that flooded instead of dropped traffic marked with VLANs it didn't have defined. It ate the STP packets from those VLANs correctly though...Ianhttps://www.blogger.com/profile/14947513738232907746noreply@blogger.comtag:blogger.com,1999:blog-3993498847203183398.post-57210307787393764792017-07-14T21:30:25.903+01:002017-07-14T21:30:25.903+01:00There are protocols to detect these conditions.
C...There are protocols to detect these conditions.<br /><br />Cisco's proprietary one is UDLD, which all Cisco equipment should support. A newer, standardised equivalent is BFD, which newer or non-Cisco equipment should support.<br /><br />https://en.wikipedia.org/wiki/UDLD<br />https://en.wikipedia.org/wiki/Bidirectional_Forwarding_Detection<br />https://tools.ietf.org/html/rfc7130Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-3993498847203183398.post-78173740732530149412017-07-14T21:05:43.187+01:002017-07-14T21:05:43.187+01:00Interesting. Thanks for the insight. Sounds simila...Interesting. Thanks for the insight. Sounds similar to a Split-Brain scenario in a server cluster, for which Best Practice is generally to apply Fencing, STONITH or similar. I guess those practices are not common in the networking world - at least not yet.dMbhttps://www.blogger.com/profile/16403792190234354123noreply@blogger.comtag:blogger.com,1999:blog-3993498847203183398.post-55758270288595438162017-07-14T20:46:27.617+01:002017-07-14T20:46:27.617+01:00> "So do we have a switch with an intermit...> "So do we have a switch with an intermittent partial failure? Is it hardware (self test says no)? Is it some hack and vulnerability? Who knows?"<br /><br />Remind of the time I had a setup on my Mikrotik switch so LAN switch portgroup (ether1-19) and WAN switch portgroup (ether20-24) were completely separate from each other (even if using same VLAN ID) and I had loop protect screw up on the LAN switch portgroup but not the WAN so it took down my LAN but not WAN. <br /><br />I just rebooted the switch didn't even bother to use serial cable. Silly me.JTLhttps://twitter.com/jtl999noreply@blogger.comtag:blogger.com,1999:blog-3993498847203183398.post-47736548011300649872017-07-14T20:13:24.736+01:002017-07-14T20:13:24.736+01:00Great to see continued openness from AA, I know ho...Great to see continued openness from AA, I know how frustrating these types of issues can be to track down, all whilst under pressure.<br /><br />Few Q's<br />- Did you manage to get a tac-pac before the reboot?<br />- Are you running BFD over the port channel?<br />- Have you looked at the ASIC/cabling layout to determine any points of commonality (N5K: show hardware internal carmel all-ports)<br />- Any changes in traffic profile pre/post incident?<br />- Is port hashing working well (e.g. are links in the bundle load sharing reasonably equally)?<br /><br />I'm sure you guys were in deeper than this but always good to cover the basics first...<br />Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-3993498847203183398.post-66105344907161477442017-07-14T20:10:46.235+01:002017-07-14T20:10:46.235+01:00I did wonder if it was my end, but as I have redun...I did wonder if it was my end, but as I have redundant providers, I carried on with the other provider.<br /><br />At least AA will tell you when something goes wrong, rather than just blaming a local PC issue.<br /><br />Finally, these sorts of things is why Google pays a team to try and break Google, in pretty much anyway imaginable, to try and fix the unexpected before they have an outage. That they can just turn off a data centre if it gets a bit warm out is so impressive.Algernon J Forthcumminghttps://www.blogger.com/profile/09282566459931277710noreply@blogger.comtag:blogger.com,1999:blog-3993498847203183398.post-47979347282149589782017-07-14T19:40:50.949+01:002017-07-14T19:40:50.949+01:00Thanks. Very useful ... first thing I did was look...Thanks. Very useful ... first thing I did was look at IRC to see 'if it was me'. I saw the activity and went off to my meeting happy that by the time I got back it would be OK. And it was. (well, I knew because I was getting texts!)Bobhttps://www.blogger.com/profile/05408365925366615419noreply@blogger.com