2016-03-25

CISCO and ARP?

The FireBrick has quite a good ARP handling subsystem, including exponential back-off, configurable ARP timeouts and so on. It has served us well, but we have recently encountered a slight problem talking to a CISCO Nexus switch.

So I did some tests - and would love to know if this is typical. Any CISCO experts reading this may be able to comment.

Testing using arping from linux, I could see that the CISCO would respond to only some of my ARP requests. Maybe one in five, but not very consistent. This is a tad odd, and may be down to some general ARP rate limiting perhaps.

On top of that, when it did respond, it did so after 2.99 seconds. This was very consistent - I had to use arping one ARP request at a time to confirm this.

I have to wonder what the hell it is doing! From a coding point of view, holding on to the ARP request or reply for that length of time is more work than just answering the ARP right away. I am at a loss as to what is going on.

For comparison, a FireBrick is timed by linux at 180us response and answered every ARP.

Anyway, it means I have had to tweak the way the ARP system renews ARPs to try a bit longer, otherwise every now and then the CISCO vanishes for a few seconds.

Oh, and yes, they still look like this with some arbitrary padding to min packet size for Ethernet.

09:40:20.688429 ARP, Request who-has 91.240.176.1 tell 91.240.176.254, length 46
0x0000:  0001 0800 0604 0001 0003 971d c009 5bf0  ..............[.
0x0010:  b0fe 0000 0000 0000 5bf0 b001 474e 5520  ........[...GNU.
0x0020:  5465 7272 7950 7261 7463 6865 7474       TerryPratchett

P.S. It was CoPP, but we don't understand why it would delay ARPs 3s in that process.

8 comments:

  1. Is dynamic ARP inspection enabled on the Nexus?

    ReplyDelete
  2. The specific model of nexus would be useful, features vary across the range, as Edward suggested CoPP - control plane policing - this allows you to rate limit traffic to the control plane, potentially including ARP. There may also be specific hardware rate-limiters. ARP would be considered a lower priority task, so I would also check to see if the CPU on the nexus is running high, perhaps your arp issue is a side effect of another issue. Dynamic ARP inspection wouldn't be my first thought, unless DHCP assigned addresses are in use.

    ReplyDelete
  3. ARP a "lower priority task" ?? Since no IP traffic can pass until that transaction can complete, I find that an interesting design decision.

    ReplyDelete
    Replies
    1. ARP as lower priority makes sense in a switch, where the main job is to switch Ethernet frames without considering any IP headers - the switching fabric can handle the former in hardware very quickly, leaving the CPU to handle "unusual" packets (like ARP) separately. Even if it stopped responding to ARP for a while, its main job of switching packets would be unaffected.

      Different on a router, of course; a Nexus being a bit of both should really do a better job of handling ARP, you'd think. 3 seconds makes me wonder if there's some sort of timeout involved, perhaps trying to check with an absent backup device whether or not to handle that address itself?

      Delete
  4. Sorry, the sigmonster burped this out shortly after reading this article, so I felt I should share;
    I know that it seems strange about the nat thing, but the explanations from cisco are very similar to the explanations given by my girlfriend. Neither make any sense, up means down and no means yes more often than not - but you can never be certain until you try and fail a few times and only once in a while you get lucky and it works.

    ReplyDelete

Comments are moderated purely to filter out obvious spam, but it means they may not show immediately.

One Touch Switching

It has been some weeks since One Touch Switching was fully live. TOTSCO say over 100,000 switch orders now, so it is making good progress, ...