The snag is that they keep falling off the internet! A power cycle fixes, but it is very frustrating.
I have found the solution though, and I think it points a finger at the cause.
And it is all down to DHCP. Yep, not DNS this time. Not IPv6 even. DHCP!
So what's the problem?
First off, what's the kit?
- FireBrick doing DHCP and Internet gateway
- Aruba APs
- Apple HomePods
The failure did not seem to be all the time, but could be. Sandra has almost given up using them as they never work. But it seems it can usually renew its DHCP without problems, but sometimes it gets stuck. The logs on the FireBrick showed we kept sending a DHCP "Offer" to the HomePo, but it keeps asking.
I added lots of debug, and confirmed that the request being sent, the DHCP "Discover", does not request a broadcast reply, which is fine, so we send the reply to the MAC of the HomePod and its "new" IP address. This is normal.
On a whim, I decided to try fudging the code to treat the discovery as if it has asked for a broadcast reply. This then meant a Discover, Offer, Request, and Ack - but the HomePod did not see the Ack and so kept asking. I then forced the broadcast on the Ack as well, and bingo, it worked. So the issue is the broadcast used for Offer and Ack.
This is a massive clue.
So more investigating.
The RFC says the broadcast request is in the left most bit of a 16 bit flag field.
PLEASE DO NOT DO SPECIFICATIONS LIKE THIS!
I fully understand that bits in a byte may be sent "on the wire" low or high bit first, or high to low bit first. I fully understand that bytes in a word may be ordered big endian or little endian. The above diagram is for a 16 bit "network byte order" value (i.e. big endian).
They number the bits from 0 to 15. Actually they number the gaps between the bits 0 to 15.
In my view there is only one way you should number bits - by their binary power of two value. I would always write that in the way we write numbers, most significant first, so would write that as bits 15 to 0, and it is bit 15 that is the B flag. I don't mind if showing as bits 15 to 8, and 7 to 0 (big endian) or even as 7 to 0, 15 to 8 (little endian), but number each bit by its power of two value, please!
Some people number as order on the wire, starting from 1. So 1 to 8 may be 0 to 7 or 7 to 0, who knows! Please do not do that. But at least if numbering bits 1 to 8, you have some clue that something is wrong.
So, to be quite frank, I actually do not know if this is bit 0 or 15 in a network byte order (big endian) 2 byte (16 bit) value. We assumed it is bit 15, i.e. bit 7 in the first byte. But seriously, from bits numbered 0 to 15 and a reference to "left most bit" I don't actually know for sure. I started to doubt we had read the RFC correctly!
Thankfully empirical testing shows the flags as 0x8000 from other devices, so either it is bit 7 of first byte, or other devices have the same fun reading the RFC.
So who is at fault here?
Well, my son has the same FireBrick and the same HomePods, but different APs. That all works. That is another clue.
My Aruba APs are set up to inject data in the DHCP, which is good. I get details of the AP and SSID, and can even tell the FireBrick to allocate based on SSID even if different SSIDs on the same physical network. All good.
It may be that it is stripping the broadcast bit, bit that does not explain why it works after a power cycle. Interestingly the working DHCP renewals did not have the injected AP details, it seems. This points further to the AP being "special"
My son does have different network switches as well, so it is just remotely possible that it is a switch level issue, but that seems unlikely - the DHCP discovers are from the right MAC so all switch learning should be fine.
P.S. Yes, I had changed the filtering to disabled already.
The work around...
FireBricks now have an option to force broadcast reply. And it works. Alpha out soon.