Wednesday, 26 August 2015

Received Packets With Error

This is driving me nuts.

I have changed a switch to a new HP 1820-24G, which seems quite a nice switch.


But I started seeing rx packet errors in the stats. Now, there is a fibre involved in the uplink on this, and I had moved the endpoint of that fibre to the loft and put an extension fibre patch lead in line, so naturally, seeing rx packet errors, I assume I have screwed up.

I have spent hours on this, new fibre patch leads, cleaning fibre ends, and so on. No joy. Still packet errors. I am coming to the conclusion the switch is lying to me, especially as, finally, it is getting to the point that it shows 1/3 of all Rx packets are in error, and that would be visible in other ways. Pings are clean, and no signs of issues with traffic apart from the reported errors.

But it is not as simple as Rx packet errors.

First off, I did not have a lot on the switch - an "uplink" on port 24, and a "downlink" on port 22 going to the switch in the loft and APs and a load of other stuff in the house. What was especially odd is that the error count for Rx errors on both (port 24 and 22) stayed the same!

So, I moved the uplink from 24 to 23. The count on 23 started going up but the total of 23+24 was the same as the Rx errors on 22!!!

I did the same the other way, moving 22 downlink to 21, but again the total Rx errors from the downlink port was the same as the Rx errors from the uplink.

This really made no sense, and the error rates were low. If I disconnected the downlink I did not see any errors from the uplink. I had an AP and a laptop connected on another port so could confirm all was working. It seemed to matter for the Rx count on the uplink as to whether the downlink was connected.

I spent ages checking and re-crimping cat5 cables, and cleaning the fibres, and changing patch leads and so on - no luck.

Eventually I decided it was clearly the switch being silly, and went on to the other job - reconnecting my neighbour on port 1 on the switch!. This involved a lot of messing about drilling holes and James crawling around in the loft to see where I was poking it through and so on. Eventually I connected port 1.

Now things changed, the uplink Rx errors went through the roof, but the downlink did not - it was still low, and was no longer the same as the uplink. Port 1 showed no Rx errors. But if I disconnect port 1 then the uplink Rx errors go back as before, quite low. If I disconnect port 1 and 22, then the uplink errors stop completely.

I have to say WTF?

Update: If I send packets with no VLAN tag that are 1500 byte payload (so 1518 total) then no errors. If I send packets with a VLAN tag that are 1500 byte payload (so 1522 total) then errors count up. This is even when the switch is set to allow jumbo frames. A clue is that if I set not to jumbo it says the MTU is 1518 on all ports, not 1522. It is clearly a bug in the switch.

21 comments:

  1. If the error-ed packet is not dropped by the switch but forwarded then surely it's an Error on the RX on one port and TX on the other?

    Out of interest have you tried another SFP module on the fibre side?

    ReplyDelete
  2. I have the same model switch in my dining room, also with an SFP providing connectivity out to the garage. According to stats, no RX errors on any ports except for one (and that port has a Homeplug connected, so not surprising).

    Worth trying a firmware update perhaps? (there have been quite a few firmware updates for these units)

    If that doesn't help, wonder if you've a dodgy unit?

    ReplyDelete
  3. Are you sending in any VLAN tagged packets - if the switch isn't doing the right thing there the ethernet checksums might be failing and being counted as errors?

    ReplyDelete
  4. Any chance these are jumbo packets?

    ReplyDelete
    Replies
    1. 1522 is not "jumbo" when using a VLAN tag, *and* the switch was set to allow jumbo packets anyway.

      Delete
  5. off-topic: How do you manage a switch in the loft.. I want to put one up in our loft but the loft gets so hot during the summer I can't imagine it will last very long before it or a bad switch mode power supply craps out.

    ReplyDelete
    Replies
    1. It is a fanless switch. The new one claims to be using 6W. Not had an issue with overheating (yet).

      Delete
  6. If it's a software fault, I'd raise a case with HP support. Certainly the last time I dealt with them on a bug, they were actually pretty good and did release a fix.

    ReplyDelete
  7. It sounds an awful lot like a bug we encountered with some ProCurve 2848 switches - http://support.hp.com/gb-en/document/c02597240

    ReplyDelete
  8. Keep us updated. I'm in the market for a new small gigabit switch so will be following with interest.

    ReplyDelete
  9. Replies
    1. See the PS, it is a red herring.

      Delete
  10. BTW, does this switch handle 9k packets?

    ReplyDelete
    Replies
    1. Yes, but counts 1522 as an ex error and forwards packet anyway!

      Delete
  11. Any resolution for this issue as I am seeing the same behaviour on 2x new 1820-24G (J9980A) switches?

    ReplyDelete
  12. Same here. 2x 1820-24G and same behaviour. But in my case I'd say they stalling my network.
    We used to have a couple of old cisco 2960 10/100 switches and never had an issue. But now with both these HP switches it seems as if the network is very slow. Pings work fine in <1ms and big file transfers don't seem to be a problem. But accessing http services from the desktops to the servers are wayyyy too slow and I would bet it's got something to do with the switches...
    Cheers

    ReplyDelete
  13. Count me in. Two 1820-8G switches here. All VLAN tagged ports have errors. I have an 1810-8G as well, and it has no problems. This error degrades the network. I get picture breakup when streaming media through the VLAN. No breakup over regular LAN. I filed a case with HP, but I may just end up returning these switches and going with another brand.

    ReplyDelete
    Replies
    1. My issues was the counters being wrong not actually error packets.

      Delete
  14. Same here. Opened a case with HP HK and they replied the switch is designed to work this way. :(
    I am now setting all devices to work at MTU 1496 to prevent those "error" packets.

    ReplyDelete
  15. It looks as if this is fixed in PT.01.14

    ReplyDelete