Thursday, 27 July 2017

Say the words three times, stand on one leg, what else?

After a couple of months I tried the Apple TV again.

Still broken, so I hassled Apple again yesterday, and to my shock they called back today saying they need some more logs. Can I make it go wrong three times and record the times.

This is sort of say "Bloody hell Apple!" three times, turn around, stand on one leg, sacrifice a chicken, burn special candles, throw salt over your shoulder, fuck knows. Does that conjure their engineering department. I hope so.

We will see. They say they will call back tomorrow for the details.

Wednesday, 26 July 2017

Nominet Domain Lock

Nominet have a new service called domain lock, and it is described here.

A little while ago I got an email about this and was puzzled, not only that it was a chargeable service, or the disproportionately high cost, but also that it seems to be targeted at the registrars and not the registrants.

It looks like a lock that stops certain things, like change of DNS or registrant details, unless/until unlocked. It looks like the unlocking is done by the registrar. I would have expected locking to be tied to a 2FA that is known only to the registrant, but reading it, that does not seem to be the case.

The email explained, if I remember correctly, that the idea was to stop any risk of things changing without authorisation. That is odd, as surely any unauthorised change would mean someone was being negligent (possibly Nominet) and that the change can be quickly corrected.

It also does not seem to protect against DNS injection attacks, etc. This is something DNSSEC should do, and is something Nominet do not charge for.

As a registrar, we have ability (acting on our customer's authority) to make changes to a domain. There is the risk that our security checks are not good and we take instructions to make a change from someone that is not the registrant. We are careful, obviously. We have actually added two factor authentication to our systems (free of charge) to help our customers have the assurance that we would not fall for such scams. But having 2FA from us to nominate seems like a pointless step, if we lacked good security, we'd take bogus instructions, unlock the domain, make the change, and lock it again.

Indeed, one of the assurances registrants have at present is that, if they fall out with their chosen registrar, they can go to Nominet to change registrar and details on the domain directly for a fee. This means rogue registrars cannot hold people to ransom in any way. The domain locking feature seems to undermine that - as there cannot be any way to bypass the registrars domain lock by pleading to Nominet, obviously. If that was possible then it would make this service useless.

So I asked Nominet, listing some of the ways a domain could be changed without authority of the registrant or registrar... I really struggle to find many where Nominet would not already be negligent to allow such a change. But I asked about...
  • If the police ask for a domain to be shut down (I say "ask" as I am not sure proper legal authority to do so always exists or that they always present it in such cases)
  • If some copyright related notice requests a domain to be shut down
  • A court orders nominate to change DNS or other details
  • If someone takes a case to DRS and the registrant loses the case and domain ownership is to be transferred
  • If the registrar does not pay Nominet fees and the domain becomes overdue
Of course, if the registrar stops paying the domain lock fees, does it automatically unlock too?

I have not even had an acknowledgement of my questions, let alone a reply. I assume none of those cases are in fact "protected". Yet, dubious allegations sent to the police against a domain holder, or even hacking one of their pages and then sending allegations, or faking dodgy email from a domain, is one way for someone to "take down" a major domain if they want to, so something to protect against.

Can a company that has the responsibility for the integrity of a database really say "that's a nice domain, it would be a pity if someone was to make an unauthorised change to it, wouldn't it?" and start asking for such a large sum to do its job and protect the integrity of the database?

Have I missed the point of this "service" somehow? Maybe someone can explain the logic here...

Friday, 21 July 2017

warning: comparison between signed and unsigned integer expressions

This is one of the stupidities in the C language and it bugs me because it would be so simple for C to just code it correctly. I'd really like a gcc option to do this!

When you store whole numbers in binary you usually have a choice of signed or unsigned. The signed version allows negative numbers but at the cost of the range of positive values possible.

For example a signed char allows values -128 to +127, but an unsigned char allows values 0 to 255.

If you compare them, using ==, !=, >, or < for example, the operation converts the signed value to an unsigned value and then compares.


signed int a = -1;
unsigned int b = 1;  
if (a > b)
   printf ("a>b\n");
if (b > a)
   printf ("b>a\n");

This print a>b even though a is -1 and b is 1!

This is because -1, converted to an unsigned value, is a big number, in fact the biggest an unsigned int can be.

What pisses me off is that, even when C was invented, the code to make the comparison work would have been one check of one bit extra. Basically, whatever the comparison, you just have to check the signed value is negative or not before making the comparison. If it is negative that means it is not equal to the unsigned value, and is smaller than the unsigned value, so whatever comparison you were doing is decided by the signed value being negative before going on to do the comparison as normal.

To me this would have been a far more logical behaviour than changing the value of the signed variable by making it unsigned.

Thursday, 20 July 2017

More on pitfalls of redundancy...

Hindsight is a wonderful thing!

I have been having long discussions this week and today. Many have the benefit of hindsight.

For kit we have in Maidenhead we have two possible ways to connect to the world! One if via the local transit, a single point of failure link. Another if via multiple (well, soon to be) diverse fibre lines to different London data centres where we have multiple transit and peering.

Even before the second leg of our ring, one leg is a single transit and the other is several transit and peering links via multiple (pairs of) routers. And even that allows fallback via the single transit link, just in case.

The problem, as ever, is a partly ill link: one that seems to be valid for traffic but is not. We had that today.

Announcing their routes primarily via local transit could work, but transit back out to the world being local is more complex. We would be offering a less redundant, and somewhat specialised, solution...

So we have the issue of hindsight verses reality. Ongoing, a link with more redundancy is better. The last few days is was not...

So do we offer knee-jerk services that are technically worse looking forward? Or do we say no, this is shit that happened, and was only wrong in hindsight?

It is almost like the good old days, err...

Today we (A&A) had another brief outage impacting broadband, ethernet and hosted customers, and VoIP. It was a bit complicated as it was one side of an LACP and so probably half of things were working and half not, and it looks like pretty much all broadband went down.

It was an error on our part - the ops team have been working hard all week, and working with a consultant, to help us investigate last week's issues with the CISCO switches. They have done a number of changes (adding more logging, etc) and diagnostics during the week. At each stage they have to assess the risk and decide if they can go ahead or wait until evening or even over night. A change today to bring back one of the links between the London data centres (one shut down on Friday) so we can test it independently of the normal operation resulted in breaking the switch links. Even the consultant thought it would be OK.

I think I can elaborate a tad more on things we know. I am sure the ops team will shout if I have misunderstood. At this stage there are aspects of what happened that are still unclear. This means we are adding some "defensive" config to try and address possible causes for the future.

The main issue, it now seems, was that the BGP links to all of our carriers from all of our switches all failed at the same time. Yeh, so much for redundancy! These are private links on a separate VRF and not connected to other BGP. The BGP is with routers on the end of locally connected single fibre links (of which we have many), not LACP or anything complicated. So the failure has to be entirely within the cisco switches. We can almost certainly rule out hardware impacting all at once. Also, being on separate VRF and not seeing Internet traffic at all, it seems unlikely some attack from outside. This leads us with the possibility of some sort of unstable config on the switches, maybe something spanning tree related (I hate spanning tree), or maybe some BGP issue with routes received from carriers, which seems unlikely, but maybe not impossible. So there is a lot of careful review of things like BGP filters from carriers, and spanning tree config, and so on.

The "fix" was rebooting half a dozen cisco switches. On Thursday this worked, but it took some time to conclude that was a sane thing to do, when other options were exhausted.

As I am sure you can appreciate, just "turning it off and back on again", or rebooting the switches, really is a last resort. We have highly skilled engineers who spent some time trying to diagnose the actual issue before taking such a step, and that is one reason these issues can take some time to fix. Sometimes a reboot can fail to solve anything but lose valuable clues.

On Friday that worked too, again we tried to understand the issue first, and got a lot more information. The reboots seems to have triggered a second issue with one of the switches being stupid (as per my other blog post) and coming up in a half broken state. Rebooting that one switch again sorted it. It is almost unheard of to have two different issues like this, one after the other, and that really threw us as well.

A lot of this week has been understanding the way the cisco switches are set up in much more detail, and adding more logging, and updating processes so we have a better idea what to do if it ever happens again - both fixing things more quickly, and finding more clues as to the cause. It may be that we have mitigated the risk of it happening by the changes being done. We hope so.

Obviously this sort of thing is pretty devastating - I am really unhappy about this, and really sorry for the hassle it has caused customers.

As I say, it is not really like the "good old days" when BT would have a BRAS crash pretty much every day. These days we expect more, and our customers expect more.

So, please do accept my apologies for the ongoing issues, and my reassurance that they are being taken very seriously.

Director, A&A

Wednesday, 19 July 2017


I have now issued a number of projects on GitHub... All under GPL.


They include the current build of my alarm panel.

Bricking it!

Well, pictures are out on twitter now - the new FireBrick sort of exists - but don't go trying to order them yet. We have many more steps to take before we will have stock, some months (I'll have a better idea tomorrow). There are little details like EMC testing for CE marking, and so on, some of which could causes delays. And still, it is made in UK!

However, seeing as there are pictures, I think I should say a few words about the new FireBrick model, the FB2900. This should avoid speculation, at least. We are still selling the FB2700, and I am not in a position to say anything about FB2900 pricing yet, this is purely some technical comment.

That is not the whole box, with no SFP screen or light pipes, etc. But you can already see some of the changes.

SFP port

One of the most obvious changes is that we have moved back to a 5 port format, as we had on the older FB105 models. But the extra port is SFP. This means it will be able to take a normal copper Ethernet port, but also various types of direct fibre links. Apart from use in data centres, and one each end of fibre links between buildings, this is thinking ahead to the days of true fibre internet services in the future.

Power supply

Anyone that has looked inside the existing FB2700 model will see we have a completely new PSU design. The change in design has allowed us to make a variety of different PSU options.

We have an option for automotive (12V and 24V). This is far more complex than it sounds - really! Automative supplies allow for something called "alternator load dump", and high voltage spikes, and a range of voltages from the supply. They have a lot of safety aspects to consider as well. However, this allows for FireBricks to run in cars, and trucks, and alarm panels, and all sorts of places where there are DC supplies.

We also have an option for higher DC voltages found in telecoms racks in data centres (-48V), an option we already have on the FB6000 series.

Even the mains voltage option is different, with the main board using 12V, we have a wide choice of suppliers for the PSU components. We have stuck with the "figure eight" power lead though.

Faster, better, stronger, we can rebuild it... etc, etc.

The new design has a faster processor, and removes a key limitation that stops the FB2700 doing much more than 350Mb/s. We have not got to the stage of benchmarking yet but expect it will be a lot faster. We are expecting faster crypto as well - we'll say more on that once we do have it all coded and benchmarked.


We have said this before so many times and not followed through, but this time it is real, honest. We have wall mount brackets and 19" rack mount brackets (for one or two FB2900s in 1U). I know, pics or it did not happen - just watch this space.

It's cool!

The power usage is lower, so the whole FireBrick will be a lot less on fire. The existing models are designed to cope with the heat, but in a confined space can get warm. The new model uses less power in the first place, and so we expect it to be a lot cooler...


Obviously we are always adding more to the firmware and more features will come along for the FB2500, FB2700 and FB2900 models. Software upgrades are still free, as always.

P.S. (and it should not be a P.S.) there are some good people working on making this happen, like Cliff and Kev, and they need some credit for this all coming together.