RevK®'s ramblings: Working with ubiquiti

2017-04-03

Working with ubiquiti

This is a separate post as something seems to have kicked off on twitter this morning. And first off I'd like to apologise to Brandon from Ubiquiti for swearing.

Ubiquiti have been very helpful trying to get to the cause of a long standing issue impacting a small number of people, but including myself. It is a very frustrating issue which has led me to consider scrapping using the Unifi APs on more than one occasion, but I do like the Unifi kit and I would like to get this actually resolved and continue selling it.

What do we think we know?

This only seems to impact Apple - it is seen on iPhones mostly - not android.
This only seems to impact Unifi APs - not seen using other APs yet.
This almost always seems to be FireBrick as gateway router (at least one case of not FireBrick)
This is a rare situation, with many people using hundreds of Unifi APs with no problem. Similarly lots of people using Apple with no problem. Similarly lots of people using FireBricks with no problem.
It seems sticky - when a set up has the issue, it stays. When a set up does not have the issue, that stays OK. It is also very intermittent and can seem to take days to be sure if fixed or not.
This seems to be only where IPv6 is on the network, which is one reason most people don't see it, and may also be a reason why cases where an IPv6 friendly router sold by an IPv6 friendly ISP is the most common case we have seen (i.e. why FireBricks in almost all cases).

As I say, Ubquiti have been very helpful - they sent us two switches, and edge router and a security gateway. I was only expecting a switch from what was said, so thank you. It has allowed more testing. We sent an FB2700, which has also allowed more testing. The results are interesting, to say the least.

Brandon has advised that using FB2700 they see the problem right away. This is good, we have created a set up with the problem. He confirms that using other gateways he does no see it. So something about the network when using a FireBrick seems to be able to trigger this somehow. Oddly he has also seen up to 60 seconds "delay getting an IP" which is not one we have seen. The problem we have seen is permanent - you lose all IPv4 and IPv6 on a roam (intermittently) and do not get an IP even after 60 seconds, all you see is the 169.254 address for when you don't get a reply. I assume that is not what Brandon was seeing, but actually a "delay", which is rather odd. If it is, then that explains the phantom delay and means he has exactly reproduced the problem.
Here, we tried moving all APs on to a unifi switch connected to our main LAN (and using FB6000 as gateway). It did not help. That eliminates the switches I have which could have been messing with multicast or something.
So I set up a separate subnet for the APs, connected to a Unifi switch, and that then connected via their EdgeRouter. Sadly I needed help setting up IPv6, but got there, in spite of some of my typos. It seemed to fix things - great.
So I changed to using an FB2700 on the same separate subnet and same Unifi switch, just swapping one box, and again it is working. I have made the set up as close to the main LAN as I can, same VLANs etc, and the APs are the same config exactly - not changed.

This means the separate subnet appears to be the fix rather than change of router.

It also means a really simple set up of FB2700, switch, and three APs here just worked, but Brandon, with presumably a similarly simple set up, immediately failed. That would be nice to try and compare.

The roaming also seems to happen, apparently as expected, with no interaction with the gateway. No DHCP or anything, just switches over from one AP to another. So it is hard to see how any gateway can be the cause of the problem.

At this point I am wondering if somehow it is a specific configuration of a network that breaks it - I hesitate to suggest the actual IPs in use somehow. I also wonder if it is something else on the LAN causing this - but that does not fit with Brandon's comments.

Unfortunately we have reached an impasse with Ubquiti - they have been very helpful up until now, and thanks for that. But even though this only happens with their APs, and only happens with Apple products, they have now concluded it must be FireBrick and "So at this point I don't think it's fair for you to ask us to help you resolve this. In doing so your are asking us to help your company make a competing product, for free." and now "So I'm out. Refuse to interact under such disrespectful terms."

We'll continue to look for the issue. I suspect, when we find it, it will not be something where any finger of blame can be pointed at a single bit of kit. But nice to know the spirit of co-operation is alive and well, up to a point. Thanks for your help so far.

FYI, I don't care that Ubuiti have a "competing product". As an ISP we work with competition all of the time for the greater good. I'd be happy to continue to work together to get to the bottom of this anyway - all of our customers would benefit from that. I will, of course, share our findings, even if we find a bug in something FireBrick is doing.

P.S. My next avenue of investigation is differences in configuration, no matter how small, to try and see if we can find a network set-up difference. It is very likely that a typical (mostly default) FireBrick network will have some notable differences to a typical (mostly default) non FireBrick set-up...

P.P.S. You gotta love it - Brandon has complained to FireBrick about one of their employees (me) swearing at him. This is from the country that actually believes in free speech.

43 comments:

JohnMonday, 3 April 2017 at 10:02:00 BST
From years of working in tech support, I recognise Ubiquiti's current position. It's a very common, but rather foolish one - "We don't know what the problem is, so therefore it's the other guy's fault."

If you draw up a 2x2 matrix of the two attitudes ("We will investigate" and "Other guy's fault") and the two possible explanations ("Really our fault" and "Really their fault") then the company which takes Ubiquiti's position loses every time.

It's a mistake which has been made countless times before.
ReplyDelete
Replies
UnknownMonday, 3 April 2017 at 11:07:00 BST
The language of competition is interesting in this context.

The FireBrick may be a competitor to one part of Ubiquiti's stack — its range of routers — but, at the same time, FireBrick users may well buy other Ubiquiti products, such as their WAPs, and hope for seamless interoperability. The more routers which work well with their WAPs, the better — but the more competition they have for their routers.

I might have used "Thanks for the opportunity of looking at this. It appears that this is something more to do with the FireBrick than our WAPs and, while we'd love this to be fixed, I'm afraid that we can't justify the cost of the engineering/support time spent on this, rather than on other issues, for a prolonged investigation." :)
ReplyDelete
Replies
AnonymousMonday, 3 April 2017 at 12:36:00 BST
Seems to me they have the same attitude to making it work as to making it secure: https://www.theregister.co.uk/2017/03/16/ubiquiti_networking_php_hole/

You've far more patience with this than me - I'd have flashed OpenWrt months ago.
ReplyDelete
Replies
AnonymousMonday, 3 April 2017 at 13:43:00 BST
So its currently working with a Firebrick yes?

If so then what happens if you restart the Unifi Controller? Does it still work?

The fact it "worked" when you changed subnet makes me wonder if its something to do with reprovisioning.

You have a Unifi switch in there so changes on subnets should have resulted in reprovisioning Unifi WAPs/Switch via the controller.

Probably dumb and you've done it....
ReplyDelete
Replies
Tony HoyleMonday, 3 April 2017 at 14:03:00 BST
I once had an odd case where the FB wouldn't hand out DHCP leases to a device because it thought the device already had one.. and instead of replying with the same/new address just went silent.

That was a config quirk though - I'd just setup a reservation on a device and it had cached the dynamically assigned address previously.. fixed by clearing the cache when I make such changes.

I wonder if a similar oddity is happening with the apple devices.
ReplyDelete
Replies
Chad HMonday, 3 April 2017 at 15:22:00 BST
Bramdon, just a Nickles worth of free advice. I'd suggest completely ignoring the cursing thing when you deal with any future RevKs....

There are good and bad points in how you've both reacted. What is clear from both of you is that you are both passionate creators, and that your respective devices are both your babies. As such you're both going to be defensive about it and you're both going to lose perspectve.

His cursing seems to be in response to what he seemed to see as your evasiveness. On one level he doesn't want to believe it's the firebrick any more than you want to believe it's your device.

Yes, I appreciate that the problem doesn't manifest when other devices are connected to yours... but remember RevK can say the exactly the same in return. Your refusal to even consider there might be some quirk in your device that only manifests under these specific circumstances isn't really that different from his cursing.

Hopefully you'll both figure it out between you, but I'd suggest avoiding finger pointing until the problem is solved, only then can either of you know whether the "cause" is Apple, you, or the firebrick, or perhaps even none of the above or all of the above.
ReplyDelete
Replies
Technical VaultMonday, 3 April 2017 at 15:38:00 BST
@revk how about putting up a public pcap of a session where roaming breaks (with WPA decryption if you could?). Do it in RF monitoring mode and induce a fault and let us all have a gander.

I'm not convinced by the argument that "other routers don't have this problem", mainly because I've seen similar roaming issues with Apple devices other non-ubiquity APs (aruba+infoblox I think, they're not under my control). If I had to guess I'd say it'll be down to either some kind of race condition or two equally valid interpretations of some spec point that makes Apple choke. What I would love to see someone narrow it down to a specific service: DHCP, RADIUS, 801.11k, etc.
ReplyDelete
Replies
RevKMonday, 3 April 2017 at 16:17:00 BST
OK so some guesses here... Firstly, if, like every internet connection in the UK and pretty much every country we have sold FireBricks, the connection is PPPoE, then the answers are pretty simple.

AAISP do native IPv6 on the internet connections over PPP which is presented either directly from FTTP NTE, or via a DSL modem as PPPoE.

The PPPoE interface will do IPV6CP and once that completes it will do a DHCPv6 client request asking for PD which is then assigned to the other interfaces. This is default, so odd asking how to turn it on. The PD can be constrained by setting the pd-interface to just be the interface(s) on which you want PD. If IPV6CP is rejected then no IPv6 is used on that interface. It is possible to set log-debug to track the PPP negotiations to confirm IPV6CP is working.

If using an Ethernet subnet/link as WAN, the FireBrick does not currently do an DHCPv6 client - it only does router solicitation / router announcements, so does not do PD. This is because we have not seen anyone that has asked for it as it is not how any internet connections are done anywhere we have sold FireBricks. We can add that is needed, but may take a little while. It would be interesting if that is the case in the US. Even in China we see PPPoE as the norm.

As for shaping, it depends what you want. A simple shaping of all traffic to 10Mb/s each way means making a shaper with tx and rx set to 10M (i.e. 10000000) and a name like "WAN". This can then be used to shape the traffic. In the case of an ethernet WAN interface, set graph="WAN" in the interface definition. In the case of PPPoE you would need to set graph="WAN" on the PPPoE definition. This also makes a graph showing levels of usage. In PPPoE it will also show loss and latency based on LCP echoes. On an ethernet interface a ping="..." can be set to add loss an latency to the same graph based on ping responses.

There are options for much more fine tuned shaping using the firewalling rules.

Not sure what port reflection is. The firewalling rules allow any sessions to be matched and changes to source/target IP and port. You can map between IPv4 and IPv6 as well. If you want some sort of incoming port mapping to an internal IP on NAT you may want to make a rule-set with source interface of the WAN (e.g. PPPoE or the WAN interface name depending on what is being used), a target interface of "self", assuming for a moment that NAT is being used and the external address of the FireBrick is the only one you have to port map. You can set the no-match-action to continue to other rule sets, and create specific rules matching a target port, and setting a new target-ip and target-port to map to devices on the LAN. You may want to set these using specific protocol, e.g. 6 for TCP.

As for VoIP, the basic premise is that VoIP and NAT do not mix. However, the FireBrick does standard NAT at the IP/port level and a good SIP gateway can recognise that and work with it - we have VoIP servers that can. The UDP timeouts are set as per RFC recommendations to long enough to allow such a VoIP gateway to manage keep-alives. However, we know some gateways and phones do not work well with NAT. However, the FireBrick itself can work as a VoIP client and server and can be set to work as a full PABX or just simply using a back to back config, thus bypassing all NAT - talking private IPs to devices on the LAN and via its public IP to a server via the WAN. This is in the VoIP config. This allows mix of IPv4 and IPv6 operations as well for VoIP.

If the WAN works in some other way (and we have people using L2TP) that will need some different advice.
ReplyDelete
Replies
RevKMonday, 3 April 2017 at 16:25:00 BST
Some posts may take a while to get approved as I have other work to do today as well!
ReplyDelete
Replies
jbsolutiosMonday, 3 April 2017 at 20:29:00 BST
Brandon, I really hope that you and RevK manage to work together on this and get it fixed.

We love both of your products, but this current issue is becoming increasingly embarrassing for us.

Although it is "his" router, a reasonable number of companies (including us) use them in our network. Our core routers are Ubnt if you're interested as RevK's ones with enough umph to do full BGP are out of our price range and yours do a cracking job.

Please don't take it to heart when RevK gets upset. Firebrick is his baby and he puts his life and sole in this and AAISP. I am sure that he, like all of us, is just extremely frustrated that none of us can work out what the root cause of this is.
ReplyDelete
Replies
1Monday, 3 April 2017 at 21:06:00 BST
AAISP Customer here with UAP and UAP-Pro. Also support two other sites with UAP-Pros. However, no Firebrick router (yet).
ReplyDelete
Replies
RevKTuesday, 4 April 2017 at 02:50:00 BST
I am glad you think you have found it but I am (again) a little confused. I just looked through email and could not find anything from you guys about this "DHCP throttling" of which you speak. Perhaps you can elaborate. What DHCP throttling thing are you talking about?

Anyway, as I have said before, we tested this with static config on the iPhone, DHCP not in use, and it still failed.

I'll go back to a set-up that does not work shortly, but now I am testing different ways to try and break the set-up here at the moment using a FireBrick. It in interesting that you immediately had problems with an FB2700 and right now I can't make it break with one! When I have exhausted that I'll put the APs back on my main LAN and confirm still broken. Then I can re-do the various tests from before, including a case where the iPhone is set up statically and not using DHCP at all.

Eliminating DHCP as the cause was one of the very first things we did, so it has been some time since I did those tests. If DHCP was the cause we'd see this when only using one AP and not roaming, which we don't.

Also, what puzzles me, is that roaming should not cause any more DHCP traffic, surely? I am also pretty sure than when the roam has failed I have tried telling the iPhone to renew its DHCP, and it has failed, but turning WiFi off and on always works.
ReplyDelete
Replies
AlexTuesday, 4 April 2017 at 08:21:00 BST
Roaming should cause a dhcp request because you may have roamed to something with the same SSID but a different subnet. This is particularly likely with larger public networks like eduroam etc.

I know from experience that Android doesn't do this (we had two eduroam networks in close proximity such that devices would frequently roam one to the other and then break), but it seems iOS and most other OSs do as these all worked fine (as they'd get a NAK from the other network and redo the discover cycle etc).
ReplyDelete
Replies
RevKTuesday, 4 April 2017 at 08:27:00 BST
I believe (and as I say I am not the WiFi expert) that "proper" roaming does not, but simply changing to another AP on same SSID should sensibly do so.
ReplyDelete
Replies
Matthew NewtonTuesday, 4 April 2017 at 13:52:00 BST
[University, eduroam, many APs]

We have had problems with Apple devices roaming in the past. I forget the exact details now, but IIRC Apple devices did some sort of weird ARP ping to the gateway to try and detect whether they were on the same network after a roam or wake-from-sleep, rather than a full DHCP DORA.

But it's been a year or two and I can't remember seeing this for a while. Might be worked around or fixed on the Apple side or the Cisco WLCs, but may be something to check. Without any sort of wireless fast reauth (which I think I've only seen Windows do correctly, sadly) you'll generally see a full reauth and DHCP on roam.

Might also be different with PSK. We've only got WPA2-Enterprise with RADIUS, so the whole associate/auth part is different/slower anyway.

I'd probably go for getting a packet capture from the AP to see what the client is doing, and compare Apple with something else.
ReplyDelete
Replies
wturrellTuesday, 4 April 2017 at 14:36:00 BST
Wi-fi Assist is turned off on the iPhone(s), right? (Also have you tried tested with an iPod Touch or an iPad that doesn't have cellular at all?)
ReplyDelete
Replies
wturrellTuesday, 4 April 2017 at 14:42:00 BST
@Brandon - further to what you said, don't iPhones also behave differently depending on whether they're plugged in or not? (e.g. I once tested one on battery and when charging - in the first case it stopped responding to pings after 30 seconds, until you sent a wake on LAN command).
ReplyDelete
Replies
UnknownTuesday, 4 April 2017 at 14:48:00 BST
As Adrian's usual testing ground appears to be "in the bath", I really do hope his phone isn't plugged in at the time!
ReplyDelete
Replies
DiogenesThursday, 20 April 2017 at 18:16:00 BST
Be careful dealing with Brandon, he is unpredictable
ReplyDelete
Replies
John BensonFriday, 26 May 2017 at 23:00:00 BST
I found that turning off IPv6 fixed everything and had a month of no problems. Upgrading the UniFi APs to 3.7.55.6308 has brought the problem back even when using only IPv4.

Tearing my hair out! Are we any closer to a fix?
ReplyDelete
Replies
cscashby-meFriday, 26 March 2021 at 16:35:00 GMT
Did you ever get a resolution on this? I have seemingly got the same kind of problem with Mikrotik router / Unifi APs, when roaming only and only with IPv6 enabled. I haven't done any additional or very in-depth / targeted testing as of yet and given it's 3 years later it may be completely different, but seems very similar.
ReplyDelete
Replies

Add comment

Comments are moderated purely to filter out obvious spam, but it means they may not show immediately.

RevK^®'s ramblings

2017-04-03

Working with ubiquiti

43 comments:

PCB designs, Ethernet, and PoE

Rules

Rules

Report Abuse