RevK®'s ramblings: Next step in AP testing here

2017-04-04

Next step in AP testing here

I have tried quite hard to get the three APs here to break when using a FireBrick FB2700 as gateway on a separate subnet (i.e. WAN side of FB2700 on my main LAN here).

What we did is move from a set-up that broke on my main LAN, to a separate subnet off the main LAN and a Ubquiti EdgeRouter. That worked! So I tried an FB2700 instead in same set up, and that worked too. So it was splitting off to a separate subnet with some sort of gateway that seemed to fix this somehow (rather than specific choice of gateway equipment).

My working theory was that there must be some network set-up aspect that is somehow triggering this issue (whether that set-up is a bug or error or not). This would account for why FireBricks seem to be a common factor as well as Unifi and Apple. FireBricks are not an off the shelf linux system so have very different default settings, and maybe that leads to the problem set-up to be much more common. Well, it was an idea.

Ubquiti had the problem immediately with an FB2700 that we sent them, so sounds like a default setup with very few changes would trigger it, but it did not do so here. I have now gone through matching settings to the gateway on my main LAN. This includes things like leaving DNS to automatic which announces the FireBrick itself as one DNS server only on each of IPv4 and IPv6. I even set up the extra VLAN for guest WiFi which is separately firewalled but on the same subnet with proxy ARP/ND between the two LANs, just in case that was a trigger somehow. After some days of doing this now, it really is "just working", which is rather frustrating.

So this morning I am back on the main LAN as before. Hopefully this will "break" things once again and hopefully quite quickly. It may be a few days to be sure.

The techies at Ubquiti have advised that a pcap on the actual AP itself may help, so the plan is, when it breaks, leave my phone in the broken state (don't move it) and try and diagnose with pcaps on the APs.

To further diagnose I also plan to set the iPhone with static IPv4 config, as some sort of "DHCP throttling" may supposedly be to blame for this. I have double checked with the other developer on FireBrick, as we have both worked on the DHCP server, and neither of us know of this "feature". However, it is worth investigating every avenue. Previous tests (albeit years ago I expect) showed the issue still happened with no DHCP involved. The problem may have changed since, so I'll repeat those tests to confirm. I'm not going to dismiss any ideas.

In case it is not obvious, when this started, years ago, the first assumption we had is that it has to be the FireBrick at fault, and I spent a long time testing things like static config to eliminate DHCP, and checking packet dumps very carefully for DHCP, ARP, ND, RA, RS protocols to try and find anything that would point to FireBrick as the cause. Only after all of that testing did we raise with Ubiquti.

I'll keep you posted...

P.S. Finally (Thursday) my phone failed, I confirmed even a static config could not send or receive packets, even to a device on same AP. I confirmed roaming to another AP does fix. I am leaving on static IPv4 config now to test.

18 comments:

Steve HillTuesday, 4 April 2017 at 10:59:00 BST
What about a separate machine with a wifi card logging what is actually being sent over the air at the 802.11 protocol level? You should be able to see the client sending authentication requests, DHCP, etc (or not) and whether anything is actually replying to that traffic.
ReplyDelete
Replies
JohnTuesday, 4 April 2017 at 13:00:00 BST
I don't know whether you have one to hand (I know at least one of your staff uses them) but MikroTik wireless access points have some pretty comprehensive WiFi sniffing capabilities.
ReplyDelete
Replies
Cecil WardTuesday, 4 April 2017 at 23:09:00 BST
Btw there are bugs in iOS which mean that when it gets into a certain _state_ you can't save static IPv4 settings in the "settings > wifi" app. Once it gets in this state it just goes back to the dhcp pane when you go out and back into settings, and keeps doing this again and again. The fix is to do a "forget network" which seems to delete the problematic state information. Other users have complained about this in Apple forums. I have submitted detailed bug reports to Apple, via two different channels in the vain hope they might fix it. Just don't get caught out, as it is maddening and frustrating, and spread the word.
ReplyDelete
Replies
AnonymousTuesday, 4 April 2017 at 23:37:00 BST
Wireshark on linux with a suitable wifi dongle (I was using one of OmniPeak's adaptors) does a good job of capturing all on-air traffic when in monitor mode.
If the WPA key used for the test is provided, then traffic can be decrypted.
I do remember that it took some manual convincing at the command-line outside of wireshark to get the adaptor into the right mode and trouble keeping it there.
ReplyDelete
Replies
UnknownWednesday, 5 April 2017 at 10:02:00 BST
I had another bad case of this last night. I get the distinct feeling this happens when I am getting roughly equal signals from each AP, perhaps something in the roaming logic is broken on UBNT's side. Sometimes it resolves quickly so I just notice a long pause in IPv4 connectivity. However, sometimes the iOS device reverts to a 169.254 address (but maintains the IPv6) and I have to disconnect/reconnect to get my IPv4 back exactly as you described.

This is using a Mikrotik, not Firebrick router - no fancy/complex setups just DHCP IPv4 and SLAAC'd IPv6 and router in the same building :).

ReplyDelete
Replies
Don't argue with the cookWednesday, 5 April 2017 at 21:23:00 BST
I hesitate to make a suggestion based on very circumstantial evidence after so much intensive investigation by so many but, for what it's worth...
I had apparently the same problem last year where the only common factors were i-things and IPv6, no Ubiquiti, no Firebrick. Well, the problem I had was more consequential but, once my ears had recovered and, on the occassions that I caught the i-thing mid-flight, I was able to confirm that it had just a 169... address. The real sufferer had generally just roamed from the kitchen with a cup of tea looking forward to continuing with her article on Mumsnet (or somesuch) from an armchair.
The three APs involved are just old home routers (but all 5GHz). DHCP (IPv4 and IPv6) was by ISC DHCP servers on an OpenSUSE machine. I rebuilt the OpenSUSE machine at the New Year and 'temporarily' enabled DHCP on the Draytek router and SLAAC for IPv6. Temporarily hasn't ended yet and the i-things have behaved perfectly for 3 months. What can this mean?
Could it be that something about a DHCP lease with certain characteristics sets an i-thing up for failure at its next WiFi roaming event? A race between IPv4 and IPv6 assignment? A clever option offered by the more sophisticated DHCP server that is 'mis-interpreted' by the i-thing in the context of a proprietray roaming extension? I don't know but if I had the tools and skills I'd try to correlate failed roams with whatever happened at the previous address assignment. All with apologies for lack of knowledge and too much guess-work.
ReplyDelete
Replies
Steve HillThursday, 6 April 2017 at 09:59:00 BST
For what its worth, we've seen roaming problems with iOS devices on Ruckus wifi kit (no Firebricks), but this seemed to be specifically related to using 802.1x and didn't occur with plain old WPA, so possibly not the same problem. I've also seen an article on the Cisco knowledgebase saying there are known problems with iOS devices roaming on Cisco access points.

Although the factors that set this problem off in each case seem to be different (e.g. Ubiquity + Firebrick in one case, Ruckus + 802.1x in another, etc.) it may all be the same iOS bug that is somehow being triggered by something that's common to all of these setups, even if that's something completely obtuse like "the bytes at offset in one of the wifi packets happen to spell 0xc0ffee" :)
ReplyDelete
Replies
JJFriday, 7 April 2017 at 19:38:00 BST
This is sounding more and more like the problem a client had (makoto's previous place). There was a subnet off the FB2700 with three UniFi APs (WPA2) and various wired machines for 'guest' and non-domain computers. Periodically clients couldn't get DHCP assignments and it looked as if the FB was ignoring requests. I think I observed it on other devices but it was primarily Apple clients that were affected. In an attempt to partition the problem I split the wired and wireless onto separate subnets with separate DHCP ranges and the problem simply went away.
ReplyDelete
Replies
AnonymousSunday, 9 April 2017 at 09:52:00 BST
Don't know if you saw this (dated 8/4/17) :

https://community.ubnt.com/t5/UniFi-Wireless/Problem-with-iPhones-roaming-between-access-points-including-in/td-p/1892537

tl;dr of post seems to be the iphone not renewing the IP lease even when specifically told to do so.
ReplyDelete
Replies

Add comment

Comments are moderated purely to filter out obvious spam, but it means they may not show immediately.

RevK^®'s ramblings

2017-04-04

Next step in AP testing here

18 comments:

ESP IDF v6

Rules

Rules

Report Abuse