Showing posts with label UNIFI. Show all posts
Showing posts with label UNIFI. Show all posts

2017-05-04

Apple and Unifi

Quick update...

I have seen this on the MacBook as well as the iPhone.

Still bugs me at home reading twitter in the bath. Switch changes not helped.

Symptoms are device thinks connected to (strong) wifi but unifi APs say not. Happens even with no DHCP involved. Happens between devices on same IP so not switch related!

I did the dump on all channels of the APs and showed the Apple device not trying to send anything.

Seems to be triggered by IPv6, and that means commonly FireBrick but not always by any means and a few people saying they have seen the same even not using Unifi!

So may be totally Apple borked.

2017-04-10

Progress on iPhone roaming

For whatever reason, the instances of the roaming issue have massively reduced in my house. The main difference was that all APs on same PoE switch, but could be the phase of the moon for all I know at this stage. It is a bugger to track down this one.

This means it is taking days to "catch it in the act". The good news is that this happen last week, and I confirmed there was good signal but no connectivity - no IP or anything even to a device on same AP. So I changed config to be fixed IP.

Today it has happened again and we have learned some concrete details of the problem. Also, it has happened in my study, and so I have the phone in the state, captured, and on charge, sat here. It is not between two APs, so should stay broken.

So what have we learned so far?

The phone was set completely static IPv4 config, so no DHCP. This means the problem is not trigged by the way DHCP works or by the FireBrick or gateway doing DHCP in an odd way - that eliminates a load of possible concerns from previous testing. The fact that many people came forward with the same issue on non FireBricks was also a relief.

The controller for the APs claims the phone is not attached, it shows it was, but that it is not now. This is a clue. The phone thinks it is, and shows full signal. So the underlying issue here is a mismatch so the phone thinks it is associated and the APs think not. This has to be a big step forward and suggests it is the roaming process itself failing somehow.

In this state (perhaps unsurprisingly), even with the fixed config, we cannot get any packets to flow, even to another devices on the same AP (and subnet).

What next?

At this point, I am keeping the phone on charge in here in the broken state as long as possible, and have set up firewall access for Ubiquti engineers to have full access the APs and the controller and see what they can find. I hope they find more clues to the problem, but I appreciate it is tricky with some issues like this.

We're doing all we can to get to the bottom of this.

Update...

The phone was in the same state having left it all night. So I started to do monitor-mode wifi dumps on my MacBook as requested (wireshark is working quite well on MacOS now). On the AP in here I did not see the MAC of the iPhone at all. I've sent them the dump anyway.

Sadly, trying to get laptop on another channel to dump that I made a config change to APs, which made the phone spring in to life... That has to be a clue for them I suspect.

So...
  • Not DHCP related
  • Failure mode is phone things associated and AP thinks not
  • We know wifi off/on on phone fixes
  • We know roam to another AP on phone fixed
  • We now know reconfigured of AP (even leaving SSID in place) fixes it
Ubiquiti think that any packet from the phone which thinks it is associated should cause a de-auth from the AP which should cause the phone to re-connect. They can't dump that on the AP, hence monitor mode. Sadly I did not capture any packets from the phone on that channel so not conclusive.

2017-04-04

Next step in AP testing here

I have tried quite hard to get the three APs here to break when using a FireBrick FB2700 as gateway on a separate subnet (i.e. WAN side of FB2700 on my main LAN here).

What we did is move from a set-up that broke on my main LAN, to a separate subnet off the main LAN and a Ubquiti EdgeRouter. That worked! So I tried an FB2700 instead in same set up, and that worked too. So it was splitting off to a separate subnet with some sort of gateway that seemed to fix this somehow (rather than specific choice of gateway equipment).

My working theory was that there must be some network set-up aspect that is somehow triggering this issue (whether that set-up is a bug or error or not). This would account for why FireBricks seem to be a common factor as well as Unifi and Apple. FireBricks are not an off the shelf linux system so have very different default settings, and maybe that leads to the problem set-up to be much more common. Well, it was an idea.

Ubquiti had the problem immediately with an FB2700 that we sent them, so sounds like a default setup with very few changes would trigger it, but it did not do so here. I have now gone through matching settings to the gateway on my main LAN. This includes things like leaving DNS to automatic which announces the FireBrick itself as one DNS server only on each of IPv4 and IPv6. I even set up the extra VLAN for guest WiFi which is separately firewalled but on the same subnet with proxy ARP/ND between the two LANs, just in case that was a trigger somehow. After some days of doing this now, it really is "just working", which is rather frustrating.

So this morning I am back on the main LAN as before. Hopefully this will "break" things once again and hopefully quite quickly. It may be a few days to be sure.

The techies at Ubquiti have advised that a pcap on the actual AP itself may help, so the plan is, when it breaks, leave my phone in the broken state (don't move it) and try and diagnose with pcaps on the APs.

To further diagnose I also plan to set the iPhone with static IPv4 config, as some sort of "DHCP throttling" may supposedly be to blame for this. I have double checked with the other developer on FireBrick, as we have both worked on the DHCP server, and neither of us know of this "feature". However, it is worth investigating every avenue. Previous tests (albeit years ago I expect) showed the issue still happened with no DHCP involved. The problem may have changed since, so I'll repeat those tests to confirm. I'm not going to dismiss any ideas.

In case it is not obvious, when this started, years ago, the first assumption we had is that it has to be the FireBrick at fault, and I spent a long time testing things like static config to eliminate DHCP, and checking packet dumps very carefully for DHCP, ARP, ND, RA, RS protocols to try and find anything that would point to FireBrick as the cause. Only after all of that testing did we raise with Ubiquti.

I'll keep you posted...

P.S. Finally (Thursday) my phone failed, I confirmed even a static config could not send or receive packets, even to a device on same AP. I confirmed roaming to another AP does fix. I am leaving on static IPv4 config now to test.

2017-04-03

Working with ubiquiti

This is a separate post as something seems to have kicked off on twitter this morning. And first off I'd like to apologise to Brandon from Ubiquiti for swearing.

Ubiquiti have been very helpful trying to get to the cause of a long standing issue impacting a small number of people, but including myself. It is a very frustrating issue which has led me to consider scrapping using the Unifi APs on more than one occasion, but I do like the Unifi kit and I would like to get this actually resolved and continue selling it.

What do we think we know?
  • This only seems to impact Apple - it is seen on iPhones mostly - not android.
  • This only seems to impact Unifi APs - not seen using other APs yet.
  • This almost always seems to be FireBrick as gateway router (at least one case of not FireBrick)
  • This is a rare situation, with many people using hundreds of Unifi APs with no problem. Similarly lots of people using Apple with no problem. Similarly lots of people using FireBricks with no problem.
  • It seems sticky - when a set up has the issue, it stays. When a set up does not have the issue, that stays OK. It is also very intermittent and can seem to take days to be sure if fixed or not.
  • This seems to be only where IPv6 is on the network, which is one reason most people don't see it, and may also be a reason why cases where an IPv6 friendly router sold by an IPv6 friendly ISP is the most common case we have seen (i.e. why FireBricks in almost all cases).
As I say, Ubquiti have been very helpful - they sent us two switches, and edge router and a security gateway. I was only expecting a switch from what was said, so thank you. It has allowed more testing. We sent an FB2700, which has also allowed more testing. The results are interesting, to say the least.
  • Brandon has advised that using FB2700 they see the problem right away. This is good, we have created a set up with the problem. He confirms that using other gateways he does no see it. So something about the network when using a FireBrick seems to be able to trigger this somehow. Oddly he has also seen up to 60 seconds "delay getting an IP" which is not one we have seen. The problem we have seen is permanent - you lose all IPv4 and IPv6 on a roam (intermittently) and do not get an IP even after 60 seconds, all you see is the 169.254 address for when you don't get a reply. I assume that is not what Brandon was seeing, but actually a "delay", which is rather odd. If it is, then that explains the phantom delay and means he has exactly reproduced the problem.
  • Here, we tried moving all APs on to a unifi switch connected to our main LAN (and using FB6000 as gateway). It did not help. That eliminates the switches I have which could have been messing with multicast or something.
  • So I set up a separate subnet for the APs, connected to a Unifi switch, and that then connected via their EdgeRouter. Sadly I needed help setting up IPv6, but got there, in spite of some of my typos. It seemed to fix things - great.
  • So I changed to using an FB2700 on the same separate subnet and same Unifi switch, just swapping one box, and again it is working. I have made the set up as close to the main LAN as I can, same VLANs etc, and the APs are the same config exactly - not changed.
This means the separate subnet appears to be the fix rather than change of router.

It also means a really simple set up of FB2700, switch, and three APs here just worked, but Brandon, with presumably a similarly simple set up, immediately failed. That would be nice to try and compare.

The roaming also seems to happen, apparently as expected, with no interaction with the gateway. No DHCP or anything, just switches over from one AP to another. So it is hard to see how any gateway can be the cause of the problem.

At this point I am wondering if somehow it is a specific configuration of a network that breaks it - I hesitate to suggest the actual IPs in use somehow. I also wonder if it is something else on the LAN causing this - but that does not fit with Brandon's comments.

Unfortunately we have reached an impasse with Ubquiti - they have been very helpful up until now, and thanks for that. But even though this only happens with their APs, and only happens with Apple products, they have now concluded it must be FireBrick and "So at this point I don't think it's fair for you to ask us to help you resolve this.  In doing so your are asking us to help your company make a competing product, for free." and now "So I'm out. Refuse to interact under such disrespectful terms."

We'll continue to look for the issue. I suspect, when we find it, it will not be something where any finger of blame can be pointed at a single bit of kit. But nice to know the spirit of co-operation is alive and well, up to a point. Thanks for your help so far.

FYI, I don't care that Ubuiti have a "competing product". As an ISP we work with competition all of the time for the greater good. I'd be happy to continue to work together to get to the bottom of this anyway - all of our customers would benefit from that. I will, of course, share our findings, even if we find a bug in something FireBrick is doing.

P.S. My next avenue of investigation is differences in configuration, no matter how small, to try and see if we can find a network set-up difference. It is very likely that a typical (mostly default) FireBrick network will have some notable differences to a typical (mostly default) non FireBrick set-up...

P.P.S. You gotta love it - Brandon has complained to FireBrick about one of their employees (me) swearing at him. This is from the country that actually believes in free speech.

2017-03-30

Where are we with Unifi and iPhone roaming?

As you will know I have spent a long time trying to understand the issues we see with the Unifi access points and roaming between them using an iPhone.

A&A sell these, and some of their PoE switches as well. We may start selling more stuff in due course. Overall the Ubiquiti stuff is pretty impressive and there is an increasingly large range of devices. The WiFi is technically very good at the hardware level, and we sell in boxes of three even for businesses.

So it is important to us that they work. I also use them at home, and my family treat me as tech support (obviously) so it is important to me if I want a quite life. They were all round this Sunday - we had sort of cancelled Mothering Sunday for obvious reasons, but everyone came round and we had pizza and chatted. They all told me in no uncertain terms that the WiFi here is crap and they even turn off on their phones and use 3G/4G when round the house. They all use iPhones. That really is a bad sign.

I myself spend a lot of my time in my office at home, but whenever I leave for the rest of the house I find I have to turn wifi off and back on. Though, technically, it is far from every time and can even be the odd day with no apparent problem, whilst other days I see many times. The problem is, as always, you remember the times it breaks.

This also makes testing hard - something changes and you watch it, and see you spend all day with no issues and think it fixed, when actually it is just intermittent, still, just as before.

I have an AC Pro and two AC LR in the house, and they are now on latest firmware. I thought that may have helped, but no. We also tried changing switches, and thought that had helped, but no.

The current state is that I have managed to mess with wiring enough in the house to actually have all three APs off a single Ubiquiti EdgeSwitch8 - one of their switches - so as to eliminate the switches as the cause of the issue.

Tip: Some of the Ubiquiti kit is still passive 24V PoE, and their switches are great as they support that, but you have to configure on the switch! It is not automatic as PoE normally is.

We also did tests with just IPv4 on the LAN, only for a few days, but that seemed to just work. This means the current thinking is that it is the IPv6 being present that is causing the issues. It could be some combination of bugs in iPhone, Ubiquiti, and even FireBrick code, for all we know. Reports from others that use this kit say no problems. We did a lot on FireBrick to try and eliminate that as the cause. However, with IPv6 on the LAN, even with IPv4 being static on the iPhone and no DHCP, it can still fail. Setting up DHCPv6 on the LAN does not seem to change things, we normally use just RA/SLAAC.

The symptoms are a sudden lack of connectivity when it roams. For a few seconds the phone may show the old IP addresses, but quickly switches to showing no IPs and then to showing the 169 auto addresses. Wait as long as you like, it is broken. You need to turn WiFi off and on (on the phone) to fix it.

Part of the reason for writing this up again is for the engineers at Ubiquiti - they are trying to fix this. Good news (though I seem to have to poke on twitter to get things progress, sorry guys). They sent me some switches and a router and gateway. Big thank you - nice to eval the kit as some of it we may start selling. We sent them a fully loaded FireBrick FB2700.

At this point the next stage is for me to try and create a setup using their kit as the gateway on the LAN and so doing IPv4 DHCP and IPv6 RA/SLAAC, and see if that breaks still. It is a pain as I cannot exactly replace my router as it is the office router. So I have set up a new IPv4 and IPv6 subnet for WiFi use. Not ideal, but will do for testing.

They, for their part, need to try and set up with a FireBrick to do the same. Can they make it break? Obviously I am on hand to help them set that up.

So setting up the Edge Router. It is a simple set up. No NAT. Fixed IP /24 IPv4 and /64 IPv6 on LAN with DHCP serving IPv4,and RA for SLAAC doing IPv6. On WAN is a simple IPv4 which can be DHCP client or static, and a simple IPv6 which can be SLAAC or static. Obviously need to set IPv6 DNS servers for RA on LAN.

So far I have managed to set up:-
  • Firewall off
  • NAT off
  • Static IPv4 on WAN (a /24 for testing)
  • Gateway 0.0.0.0/0 route on WAN, can ping out to internet
  • Static IPv6 on WAN (a /64, obviously, from my PI block)
  • Gateway for IPv6 on WAN
  • Static IPv4 on LAN
  • DHCP IPv4 on LAN
  • Static IPv6 on LAN
  • RA on LAN configured by ubiquiti for me
And I am stuck. So waiting on Ubiquiti at this stage. Suffice to say I don't think they are a threat to FireBrick as this is all pretty simple on a FireBrick.

No word on where they are with FireBricks. Obviously keen to help them test the other way around. To be fair, if this is either a bug in FireBrick in some way, or more likely, something we can work around by changing FireBrick in some way, I am more than happy to do the work to make that happen. We have implemented a number of "pragmatic" aspects to the way the FireBrick works (sometimes on a config setting so as to be "standard" by default) and I'd really like this WiFi kit to work...

I think best if I update this post as we make progress for a bit rather than new posts. Let's get to the bottom of this, shall we?

Updates:
  • From comments, it is not just FireBrick, but is some rare combination of things clearly, and seems to be Ubiquiti APs and iPhones and "something" else.
  • IPv4 gateway not working was user error, I mistyped as 0.0.0.0/24 for some reason
  • Someone from Ubiquiti, in Austin, Texas, in the middle of the night, is working with me on this now.
  • IPv6 gateway was not working as I was using the zero address in the /64 which the ER had assumed it can have making it a router on the WAN side, which is unexpected. I changed to the ::1 in the /64.
  • Now wifi all on ER not using FireBrick, thanks to guys from Ubiquiti working in middle of the night. Roaming appears to be working, more testing to do. I am being sent a cap of working roaming as seen by ER, and will get same from FireBrick.
  • We now have two interchangeable set-ups. Both on same sets of IPs as a separate subnet for my WiFi run as a LAN side of a router. I have the ubiquiti EdgeRouter set up, and the same set up on an FB2700. At present both seem to "just work" but as I say, this can take a whole to see the fault. I have lots of logging. One clue is that I am sure I have seen the iPhone re-do DHCP on roam, and the current testing (on both set-ups) does not do that - it just flips over to new AP basically seamlessly. So, just more testing for now. If both these "just work" we have to go back and see what else on the main LAN could be upsetting things in any way.
  • This morning (Saturday), still no apparent roaming issues! This is using a FireBrick but on a separate LAN the same as the ER set-up. Again, if the roaming happens without involving the gateway router, no way the FireBrick can be to blame. If it is OK for a few days I look to swap back to main LAN and see if that shows the problem again.
  • Sunday, still using a separate FireBrick as gateway, and have set up the second VLAN that was being used before on it. Still not failing. This makes no sense at all.

QR abuse...

I'm known for QR code stuff, and my library, but I have done some abuse of them for fun - I did round pixels  rather than rectangular, f...