Tuesday, 4 April 2017

Next step in AP testing here

I have tried quite hard to get the three APs here to break when using a FireBrick FB2700 as gateway on a separate subnet (i.e. WAN side of FB2700 on my main LAN here).

What we did is move from a set-up that broke on my main LAN, to a separate subnet off the main LAN and a Ubquiti EdgeRouter. That worked! So I tried an FB2700 instead in same set up, and that worked too. So it was splitting off to a separate subnet with some sort of gateway that seemed to fix this somehow (rather than specific choice of gateway equipment).

My working theory was that there must be some network set-up aspect that is somehow triggering this issue (whether that set-up is a bug or error or not). This would account for why FireBricks seem to be a common factor as well as Unifi and Apple. FireBricks are not an off the shelf linux system so have very different default settings, and maybe that leads to the problem set-up to be much more common. Well, it was an idea.

Ubquiti had the problem immediately with an FB2700 that we sent them, so sounds like a default setup with very few changes would trigger it, but it did not do so here. I have now gone through matching settings to the gateway on my main LAN. This includes things like leaving DNS to automatic which announces the FireBrick itself as one DNS server only on each of IPv4 and IPv6. I even set up the extra VLAN for guest WiFi which is separately firewalled but on the same subnet with proxy ARP/ND between the two LANs, just in case that was a trigger somehow. After some days of doing this now, it really is "just working", which is rather frustrating.

So this morning I am back on the main LAN as before. Hopefully this will "break" things once again and hopefully quite quickly. It may be a few days to be sure.

The techies at Ubquiti have advised that a pcap on the actual AP itself may help, so the plan is, when it breaks, leave my phone in the broken state (don't move it) and try and diagnose with pcaps on the APs.

To further diagnose I also plan to set the iPhone with static IPv4 config, as some sort of "DHCP throttling" may supposedly be to blame for this. I have double checked with the other developer on FireBrick, as we have both worked on the DHCP server, and neither of us know of this "feature". However, it is worth investigating every avenue. Previous tests (albeit years ago I expect) showed the issue still happened with no DHCP involved. The problem may have changed since, so I'll repeat those tests to confirm. I'm not going to dismiss any ideas.

In case it is not obvious, when this started, years ago, the first assumption we had is that it has to be the FireBrick at fault, and I spent a long time testing things like static config to eliminate DHCP, and checking packet dumps very carefully for DHCP, ARP, ND, RA, RS protocols to try and find anything that would point to FireBrick as the cause. Only after all of that testing did we raise with Ubiquti.

I'll keep you posted...

P.S. Finally (Thursday) my phone failed, I confirmed even a static config could not send or receive packets, even to a device on same AP. I confirmed roaming to another AP does fix. I am leaving on static IPv4 config now to test.

18 comments:

  1. What about a separate machine with a wifi card logging what is actually being sent over the air at the 802.11 protocol level? You should be able to see the client sending authentication requests, DHCP, etc (or not) and whether anything is actually replying to that traffic.

    ReplyDelete
    Replies
    1. You're probably going to want to simultaneously log both the wired backbone and the wireless traffic to form a full view. If there's some kind of inter-access point management traffic on the backbone you could miss it if you're only monitoring 802.11. At one point I think Ubiquiti supported 802.11F/IAPP but I assume they replaced that with something else when it was withdrawn.

      Delete
  2. I don't know whether you have one to hand (I know at least one of your staff uses them) but MikroTik wireless access points have some pretty comprehensive WiFi sniffing capabilities.

    ReplyDelete
  3. Btw there are bugs in iOS which mean that when it gets into a certain _state_ you can't save static IPv4 settings in the "settings > wifi" app. Once it gets in this state it just goes back to the dhcp pane when you go out and back into settings, and keeps doing this again and again. The fix is to do a "forget network" which seems to delete the problematic state information. Other users have complained about this in Apple forums. I have submitted detailed bug reports to Apple, via two different channels in the vain hope they might fix it. Just don't get caught out, as it is maddening and frustrating, and spread the word.

    ReplyDelete
  4. Wireshark on linux with a suitable wifi dongle (I was using one of OmniPeak's adaptors) does a good job of capturing all on-air traffic when in monitor mode.
    If the WPA key used for the test is provided, then traffic can be decrypted.
    I do remember that it took some manual convincing at the command-line outside of wireshark to get the adaptor into the right mode and trouble keeping it there.

    ReplyDelete
    Replies
    1. "Trouble keeping it there" might be NetworkManager's fault - make sure its shut down and stays shut down.

      Delete
  5. I had another bad case of this last night. I get the distinct feeling this happens when I am getting roughly equal signals from each AP, perhaps something in the roaming logic is broken on UBNT's side. Sometimes it resolves quickly so I just notice a long pause in IPv4 connectivity. However, sometimes the iOS device reverts to a 169.254 address (but maintains the IPv6) and I have to disconnect/reconnect to get my IPv4 back exactly as you described.

    This is using a Mikrotik, not Firebrick router - no fancy/complex setups just DHCP IPv4 and SLAAC'd IPv6 and router in the same building :).

    ReplyDelete
    Replies
    1. Thank you for confirming, that is exactly what we were seeing. And I am glad it is not using a FireBrick as Brandon seems to latch on to any explanation and repeat it ad nausiem. Now we know for sure it is not FireBrick specific. Can you confirm what switches you use?

      Delete
    2. 1 AP is connected directly to a gigabit port on the Mikrotik router (RB2011). The other is connected to the same switch group on the router via two bog-standard netgear unmamaged switches (router - GS108 - GS208 - AP). Both APs are UAP AC Lite.

      Delete
    3. We had some net gear here. I thought I had eliminated switches as the cause. I wonder.

      Delete
    4. Seems unlikely a dumb switch would fail in a way that would break roaming on their APs, but nothing else and it *not* be UBNT's fault. I'll try and pull an extra cable so the APs can both be directly in the router's switch...

      Delete
  6. I hesitate to make a suggestion based on very circumstantial evidence after so much intensive investigation by so many but, for what it's worth...
    I had apparently the same problem last year where the only common factors were i-things and IPv6, no Ubiquiti, no Firebrick. Well, the problem I had was more consequential but, once my ears had recovered and, on the occassions that I caught the i-thing mid-flight, I was able to confirm that it had just a 169... address. The real sufferer had generally just roamed from the kitchen with a cup of tea looking forward to continuing with her article on Mumsnet (or somesuch) from an armchair.
    The three APs involved are just old home routers (but all 5GHz). DHCP (IPv4 and IPv6) was by ISC DHCP servers on an OpenSUSE machine. I rebuilt the OpenSUSE machine at the New Year and 'temporarily' enabled DHCP on the Draytek router and SLAAC for IPv6. Temporarily hasn't ended yet and the i-things have behaved perfectly for 3 months. What can this mean?
    Could it be that something about a DHCP lease with certain characteristics sets an i-thing up for failure at its next WiFi roaming event? A race between IPv4 and IPv6 assignment? A clever option offered by the more sophisticated DHCP server that is 'mis-interpreted' by the i-thing in the context of a proprietray roaming extension? I don't know but if I had the tools and skills I'd try to correlate failed roams with whatever happened at the previous address assignment. All with apologies for lack of knowledge and too much guess-work.

    ReplyDelete
    Replies
    1. I am glad you found a work around. That is another piece of the puzzle, certainly. Here I cannot now make it go wrong so cannot repeat my non-DHCP testing, sadly. This is also useful as another non FireBrick case, suggesting it is something in network set up (we know IPv6 is a factor) such as DHCP settings which may be more common on some gateways.

      Delete
    2. Is there nothing you can put on the iphone like wireshark - or do you have to "jailbreak" it?

      I suspect you're going to need to run a TCP dump on both ends (iphone & AP) on a daily basis until it fails.

      That's the only way to see what triggers it at the IP layer.

      However I suspect that what you're seeing is at the PHY layer and is a bug or a workaround in iOS for some engineering defect in the device's design.

      I speculate here, but given Apple has history for this on the RF side I'd think it likely if not probable the latter is the case.

      Best of luck - if you do work out what it is and it is in fact iOS then I'd charge Apple for your work.

      If iOS was even as "open-source" as Android then no, but its Apple's walled garden so they can pay the maintenance costs ;)

      Didn't Apple stuff used to "just work"?

      I refer to your TV stuff (not the first tale of woe I've heard on that) and various iOS upgrade nightmares I've had recounted to me over the last decade.

      Maybe it "just worked" when they didn't have many customers - or when they weren't so interested in monetising them. I digress :)

      Best of luck, if anyone can nail the cause here it might just be you....

      Delete
  7. For what its worth, we've seen roaming problems with iOS devices on Ruckus wifi kit (no Firebricks), but this seemed to be specifically related to using 802.1x and didn't occur with plain old WPA, so possibly not the same problem. I've also seen an article on the Cisco knowledgebase saying there are known problems with iOS devices roaming on Cisco access points.

    Although the factors that set this problem off in each case seem to be different (e.g. Ubiquity + Firebrick in one case, Ruckus + 802.1x in another, etc.) it may all be the same iOS bug that is somehow being triggered by something that's common to all of these setups, even if that's something completely obtuse like "the bytes at offset in one of the wifi packets happen to spell 0xc0ffee" :)

    ReplyDelete
  8. This is sounding more and more like the problem a client had (makoto's previous place). There was a subnet off the FB2700 with three UniFi APs (WPA2) and various wired machines for 'guest' and non-domain computers. Periodically clients couldn't get DHCP assignments and it looked as if the FB was ignoring requests. I think I observed it on other devices but it was primarily Apple clients that were affected. In an attempt to partition the problem I split the wired and wireless onto separate subnets with separate DHCP ranges and the problem simply went away.

    ReplyDelete
  9. Don't know if you saw this (dated 8/4/17) :

    https://community.ubnt.com/t5/UniFi-Wireless/Problem-with-iPhones-roaming-between-access-points-including-in/td-p/1892537

    tl;dr of post seems to be the iphone not renewing the IP lease even when specifically told to do so.

    ReplyDelete
    Replies
    1. That thread progresses and a couple of people are suggesting that "Wifi Assist" may be an issue for link-local addresses.

      Seems entirely plausible to me that "Wifi Assist" may be causing a race condition when roaming between APs with the same SSID.

      As is often common the "community" may know more than the manufacturer's support people :)

      Delete