This means it is taking days to "catch it in the act". The good news is that this happen last week, and I confirmed there was good signal but no connectivity - no IP or anything even to a device on same AP. So I changed config to be fixed IP.
Today it has happened again and we have learned some concrete details of the problem. Also, it has happened in my study, and so I have the phone in the state, captured, and on charge, sat here. It is not between two APs, so should stay broken.
So what have we learned so far?
The phone was set completely static IPv4 config, so no DHCP. This means the problem is not trigged by the way DHCP works or by the FireBrick or gateway doing DHCP in an odd way - that eliminates a load of possible concerns from previous testing. The fact that many people came forward with the same issue on non FireBricks was also a relief.
The controller for the APs claims the phone is not attached, it shows it was, but that it is not now. This is a clue. The phone thinks it is, and shows full signal. So the underlying issue here is a mismatch so the phone thinks it is associated and the APs think not. This has to be a big step forward and suggests it is the roaming process itself failing somehow.
In this state (perhaps unsurprisingly), even with the fixed config, we cannot get any packets to flow, even to another devices on the same AP (and subnet).
At this point, I am keeping the phone on charge in here in the broken state as long as possible, and have set up firewall access for Ubiquti engineers to have full access the APs and the controller and see what they can find. I hope they find more clues to the problem, but I appreciate it is tricky with some issues like this.
We're doing all we can to get to the bottom of this.
The phone was in the same state having left it all night. So I started to do monitor-mode wifi dumps on my MacBook as requested (wireshark is working quite well on MacOS now). On the AP in here I did not see the MAC of the iPhone at all. I've sent them the dump anyway.
Sadly, trying to get laptop on another channel to dump that I made a config change to APs, which made the phone spring in to life... That has to be a clue for them I suspect.
- Not DHCP related
- Failure mode is phone things associated and AP thinks not
- We know wifi off/on on phone fixes
- We know roam to another AP on phone fixed
- We now know reconfigured of AP (even leaving SSID in place) fixes it
Ubiquiti think that any packet from the phone which thinks it is associated should cause a de-auth from the AP which should cause the phone to re-connect. They can't dump that on the AP, hence monitor mode. Sadly I did not capture any packets from the phone on that channel so not conclusive.