Tuesday, 28 April 2015

Back doors

Obviously, especially since Snowdon, we are all concerned over "back doors" in systems.

But we have had some interesting discussions in the office. Working on FireBrick we make hardware and software from scratch. But even then we are using standard parts, such as processors and Ethernet controllers, and so on.

One of the mind games we play is trying to work out how someone could infiltrate us, using social engineering or technologically or whatever. It is a fun game, but is worth considering, in case we find any defences.

So we pondered, what if the chips we use had back doors? What could those back doors be, and how could they work.

Well, I had two ideas. One was something that tries to pass information to "them", via Ethernet frames. But such a system would be spotted. If not by us during testing, but by millions of other people.

But a simpler idea is something passive - even in a simple Ethernet controller. These things have access to the memory of the system via the bus and DMA and so on. They need this to send and receive legitimate packets.

If I wanted to implant a back door, I would make an Ethernet controller able to respond to a specially crafted packet. Instead of passing that to the processor as normal, it would take some action and send a reply packet. The action could simply be to allow reading or writing of system memory using the same DMA and memory access needed to send and receive normal packets.

The upshot would be that nothing would be detectable unless targeted.

But if targeted, the packets would look like normal IP packets. The payloads could even be scrambled or encrypted in some way. It could be used to attack anything that is accessible on the Internet and provide a way to access the running memory of the system remotely.

This could allow access to private keys for encryption and allow patching of code live to add proper back doors.

Now, this could apply to an Ethernet controller chip, or even a library part included in a custom logic gate array. It could be in an Ethernet card or whatever. The back door itself could be tiny in terms of silicon if all it does is read and write memory in response to some simple packet. Even people making their own silicon could find they have a back door!

The only issue is the reputation of the manufacturer if caught out... Is that enough to protect us? If some large company making such devices caved in to pressure? What if a few key employees caved in to a bribe. Scary?

Manually keyed IPsec

My previous post explained a bit about IPsec. We have released it in phases, with manual keying first, then IKEv2 key negotiation, and now EAP with certificates and road warrior stuff.

As part of the work on this release we considered actually removing manual keying as a feature. Whilst it was a useful first stage of the release of the initial version, it is not as secure because the key itself never changes, so is more vulnerable to decoding intercepted traffic eventually. With IKEv2 the key regularly changes making the job a lot harder. We released IKEv2 18 months ago, soon after the initial manual keying release.

The latest change is not primarily about security but convenience. The EAP changes allow identity to be agreed using certificates and usernames and passwords, but the key exchange and updates are managed by IKEv2. EAP makes it a lot easier for mobile phones to connect in to a FireBrick, and get an IP allocation and so on.

So, removing the support for manual keying was a consideration. We looked at other devices and concluded that manual keying was not that commonly supported and that key exchange was more common. IKEv2 is being added to devices more recently, with some devices having supported IKEv1 (which we do not). But manual keying was pretty rare. Indeed, we did not find any examples.

We decided it was going a bit far to actually remove the feature just yet, but managed to leave a problem where the config from an old version using manual keying would not work on the newest release. This, in itself, was a mistake. Having decided to keep support for manual keying, at least for now, we should have made the config compatible (moving the old format to the new one automatically). But the fact we assumed nobody was using manual keying kind of put that on the bottom of the list, and it was left out.

Sadly, that was a bad call! Sorry.

It seems that we have a wiki that explains setting up FireBrick to FireBrick IPsec using manual keying, and even A&A staff were using that until recently. The upshot is that code updates broke some IPsec tunnels. The wiki should have been updated 18 months ago!

So, we have stopped the roll out of new code today, and we should have a new release tomorrow morning. It will automatically map the old config to the new. It seems we only had a hand full of people affected, and support staff have been working to ensure that they are configured correctly.

The other good news is that anyone upgraded today, and using IPsec manual keying, and that has not touched their config, will be updated tomorrow with the config change to make it work again.

But having realised that some people are using manual keying, we need to find a way to try and get people to upgrade their config to IKEv2 with a pre-shared secret. This is not just because we'd like to remove manual keying, but because it is not as secure. The problem is how we know who is using IPsec with manual keying?

Now, we have some mailing lists, but we emailed asking if anyone was using manual keying as part of the research we did in this release, and had no replies. The problem is a lot of people actually using IPsec have had someone else help them set it up, and have no idea.

So, we have a plan. We are very careful about the way in which a FireBrick phones home. It is a security product and people are understandably concerned about security. We have no back-doors in the product. The closest to phoning home is the software and capability update check, based on a DNS lookup, that allows us to do upgrades - but people can control that in the config. The only other thing we have is the default fb-support log target which is there primarily to email us if a FireBrick crashes. All of this is under the control of the person configuring the FireBrick and can be turned off. Though we don't recommend it as crash logs are useful to help fix any problems, and software updates are important. The key thing here is that we are not secretive about this and we give people a choice.

So, the plan is that the release tomorrow will not only update any IPsec manual keying config, but also log the fact to the fb-support log, which means we'll get an email. It may mean a dozen or so FireBricks email us, and we can then contact people to help them with a config update to use IKEv2 pre-shared secrets instead. If we can be reasonably sure that nobody is using manually keyed IPsec we can then look to remove it from future releases.

This should give us a good mechanism for managing such things in the future. Ultimately, whilst this was a bad call by me on this, even if the number of people affected is tiny, we learn from such mistakes and aim to improve the product. Ironically, I suspect this blog will make more people aware of this issue than the mailing lists, which is a shame. If you use a FireBrick, please do sign up.

Monday, 27 April 2015


We have spent literally years working on IPsec in the FireBrick. It is a complex project but it is finally getting to an end, well, sort of - at least a major milestone and release.

IPsec is very much seen as an industry standard way to create Virtual Private Network (VPN) links, both point to point between offices, and "road warrior" roaming from mobile networks in to office networks.

The problem is that IPsec is a lot of layers, a lot of standards, and far from simple. There are the low level encryption and hashing algorithms. These take a lot of work to implement and test. They also need a lot of low level maths functions coding as well. There are then layers and layers of protocol on top.

For a long time the FireBrick has supported manual keying - that means the key (typically entered as a long hex string) is entered both ends and is fixed. This was the first layer and meant having all of the basic algorithms working. We recommend that anyone still using manual keying between FireBricks changes to using IKEv2 and a pre-shared key.

We then added IKEv2 (18 months ago), which is a key exchange protocol. This allowed keys to be negotiated dynamically rather than being fixed in the config at each end. This is a big improvement on the manual keying, and allows the key exchange based on a simple shared pass phrase. We did not implement IKEv1 as it was not quite such a clean standard and many devices are now doing IKEv2 (even apple).

This final stage involves EAP, which is complicated by the fact that certificates are used to authenticate the server (FireBrick) end. This has meant implementing the whole system for managing checking and signing using certificates and keys. The client can then authenticate using a simple username and password. It add to the fun, iPhones do not allow a simple manually entered config, but a profile file that is loaded. In some ways this helps as end users can just click on it from an email, but it makes it more complex for the sysadmin to set up.

The upshot of this that iPhones and Androids can connect in to an office LAN securely.

Now, when I say "we", I mean pretty much "Cliff". He has spent day and night working on this. The work we have done gives a good foundation using our own code (so no NSA/GCHQ back doors) and allows a lot more work to be done in encryption in the FireBrick. The next generation of hardware we are working on even has a true random number generator built in as well as some options for hardware encryption accelleration. We know people have asked about OpenVPN, and we are looking in to this as well.

Even so, the IPsec setup is still complex, and I have made a cheat sheet (here). But hey, if it was not complicated then my friends would not be able to sell consultancy :-)

The latest code is in beta now, and should be a factory release shortly. We suspect there will be a few more bits of work down the line on this and new releases in due course, but now we finally have it working with common mobile devices we can start working on some of the other new features in the FireBrick code. So, well done Cliff.

Friday, 24 April 2015

What is a broadband fault?

Exterminate the orc?
Continuing a theme here, but my mate Mike came over. One of the games he is quite good at is playing devil's advocate, rather well, and he actually made some really good arguments on the issue of SFI charges and faults. I'd like to thank him for that, oh, and for the dalek he brought over to surprise us!

The crux of his point is that the very definition of a broadband service is that it is rate adaptive and so it is a service that will try and get the best performance on the voice grade copper pair to which it is attached. It is possible for the characteristics of a voice grade copper pair to be so bad (very long line) that it cannot even do broadband, and that would be tough. That basically the performance of the broadband line is kind of whatever it is. Indeed, a comment on my previous post about TalkTalk was raising this very point.

He has a very good point, and it has allowed me to hone the arguments I can use against such points with the likes of BT and TalkTalk.

Yes, the broadband is what it is, "as well as it can do" on the voice grade copper pair to which it is attached. We do accept that to some extent. There are caveats - in that the copper pair will have a forecast speed, and if the line is way out of that, then that probably means something is wrong. But in general, you get what you get. However, this relates to the rate of the broadband, as it is rate adaptive.

I would argue that there is a huge difference between a line that is a low rate on a line because it is all the line can do, even if that means the rate has reduced a bit over time due to increases in cross talk from other lines and so on, and an actual fault.

The issue comes down to how you define a fault in the broadband. This is different to how you define a fault in the voice grade copper pair.

One of the key aspects in defining a fault - perhaps the key aspect - is that a fault can be fixed. It is a condition that is the result of some aspect of the copper pair which is not as good as it could be giving the length and routing of the pair. Faults could be down to poor joints, corrosion, water ingress, poor insulation, electrical contacts, degraded materials. It could be fixed by using a different pair, or repairing joints or equipment.

So how do you spot a fault? Well, BT plc t/a BT Wholesale actually defined one way of doing this. They monitor the line in the first 10 days and find the maximum stable rate for the line, and set a fault threshold rate (FTR) for the line based on that (with a percentage taken off). This defines a sync speed below which the line is considered to have a broadband fault. That is excellent as it gives a crystal clear metric which we can all agree defines a fault. TalkTalk don't do this, but it would not be hard for us to do the same based on sync speed history and agree a metric with them.

Another way to measure a fault would be to consider that the line loses sync lots of times. To be honest, once a day is too often. Normal lines stay in sync indefinitely and would only lose sync do to some external interference, or a power blip or some such. So frequent loss of sync should be a fault. Bear in mind, even a long and highly lossy line should stay in sync, albeit at a slower speed.

We also have metrics in terms of packet loss. Whilst we measure latency, that is not normally a clue to any sort of line/sync issue and more likely to be backhaul congestion (or a router moving enough traffic to build up a queue). I think packet loss (when no traffic) is a really good indicator of a fault. Also, if the line has any internal metrics (OAM frames, Header errors, FEC errors) that is an indication of a fault.

Indeed, the DSLAM can almost certainly report if there are lots of bit swaps, problem frequency bins, FEC errors, HEC or other error stats that indicate a fault, as opposed to simply being a poor/long line.

Unfortunately, neither BT nor TalkTalk define metrics for packet loss, or line errors or re-syncs as a fault metric. I think they should.

Fortunately, when there is a fault, the levels are pretty clear cut. It is rare for a line to have a "tiny bit of loss", though not impossible.

In practice, the fact that the broadband has a fault of some sort, based on our monitoring, or the DSLAM stats, is not itself usually the issue. Usually we can agree that there is an issue with the broadband. We can also usually agree that the copper pair meets the voice spec, when it does.

One of the key things that tells us we have a fault is when a line that used to just work fine (at whatever speed it had) now has problems which it did not have before by any of those metrics. That suggests it can be fixed and made to be back the way it used to be by some means. We know what is possible at that point.

I think we just need to pin down ground rules with providers for what is and is not a broadband fault, and pin down that they are responsible for broadband faults, not us.

So let's start the discussions with them, and try to make things better. Well done TalkTalk for being the ones wanting to talk at this stage. Well, either that or we send round the exterminator!

Did not fit through the door

Race to the bottom

Update: TalkTalk really are trying - and this looks like a case of a discussion point on this one line rather than a more general policy, so stand down the panic for now... Let's talk to them!

Latest from TalkTalk - if the broadband service they sell us gets a fault, their side of the agreed demarcation point (e.g. in the line itself), they will not even try and fix it, but will decide that they can no longer provide the service.

So get a fault - fix by ceasing the service.

I guess that is one way to stitch up your customer!

Well done TalkTalk - you are sinking lower than BT Wholesale now.

Update: We may be making some progress on this, so no need for everyone to ask us to move them back to BT just yet, but thanks to those that did ask. I do hope that we will soon have more sensible ways of working with both providers.

The plot thickens

To SFI or not to SFI?

BT plc t/a BT Wholesale have stated that "SFI2 is an Openreach service which is made available to BT Wholesale customers & charged for on a modular basis." and "The SFI2 visit simply checks whether a line is working within the specification of SIN 349." This is, of course, the basis of my various rants on the matter, and that by that definition there is no way to get a broadband fault fixed.

However, BT plc t/a Openreach, who (from the above) actually provide the SFI2 service state that "SFI2 is a chargeable investigation product that attempts to identify and resolve Digital Subscriber Line (DSL) Service affecting problems." and it goes on to explain that this service is used when the line "is apparently working within the LLU contractual specification of SIN349".

It goes on to explain the modules available, checking at the exchange, and checking the network, etc. It goes on to list all of the steps done by the engineer in the initial (base) module to identify the cause of the problem, such as checking the modem is connected and shows sync. It does say that the engineer does a pair quality test, and if that fails, he will work on basis of a line fault first, and then go back to trying to resolve any remaining broadband issues.

This is, basically, the way SFI2 engineers used to be defined by BT plc t/a BT Wholesale. So why have BT plc t/a BT Wholesale changed their definition of this service, whilst now claiming it is simply a service provided by BT plc t/a Openreach and offered to us. And why are they trying to charge us for it? It is defined as service they can buy to fix the service they sell us (broadband) and even defined in such a way that it should only by used where the line meets SIN349 already, which means BT plc t/a BT Wholesales charging (when it meets SIN349) would always charge us. It would be a mistake for an SFI2 visit to not be chargeable as it should not have been requested in the first place if the line does not meet SIN349.

One of them, either BT plc t/a BT Wholesale, or BT plc t/a Openreach, must surely be lying to us? And the only possible reason we can think of for doing so is to charge us money, making such a lie in to criminal fraud, in my opinion.

We've asked questions about the apparent difference between the two descriptions of the service, and await answers. However, the previous email on this has been weeks with no reply.

For now, we have to work on BT plc t/a BT Wholesale's statements as to this optional service which we would never want or need, and ask them to actually fix the broadband.

It is worth mentioning that this may impact how we deal with TalkTalk slightly, as they are not making the same claims about what the SFI2 service provides. They are, also, responding to us and asking to discuss how we can work together to come up with something better. Well done Talk Talk.

Thursday, 23 April 2015

Trying to do the right thing

These new migration rules for broadband are a bloody nightmare. We have already had to contact customers for explicit consent to email the notices (thanks to all of those that have confirmed so far). But the next challenge I am facing is the contents of that Notice of Transfer.

One of the things we have to include is details of the amount of the Early Termination Charge due at the expected Migration Date.

This may sound simple, but it is far from it. I'll explain some of my problems here, and what I am thinking of doing.

30 days notice

Our units based tariffs operate on a simple 30 days notice basis. This is simple as the current system of migration code (MAC) is valid for 30 days, so asking for one is giving us 30 days notice and we invoice (or credit) so you are paying up to 30 days time when you request the MAC.

With the new system the first we hear is when we get a notice of transfer, which has a 10 day lead time. Now. if we stick to 30 days notice, does that mean there is a 20 day "early termination charge"? Well no - it is not an early termination charge is it? It applies after one month or after 10 years of service, it is not a charge for terminating "early" in any way.

But we also run in to an issue that someone could "give us notice" on the 1st that they will be migrating at the end of the month, and then start a migrate on 20th (with 10 days lead time) and rightly expect not to pay any extra as they gave notice 30 days before. We are simply not geared up for that, so I expect we will change the logic so there is no "30 days notice" any more, and just a final bill adjusting to the termination date. We may make the units tariff have a 1 month min term in that case.

Minimum term services

This should, surely be simple? We have Home::1 with 6 months minimum term, Office::1 and FTTC with 12 months minimum term. So when you migrate there is the charge for the remainder of the minimum term.

The complication here is that if the service has already been billed to the end of the month, or quarter, is the "early termination charge" the value of the final invoice you get, extending that to the minimum term, or is it the charge from the migration date to the end of the minimum term which includes some you have already been billed?

Also, if you ask to migrate on say the 28th of the month, you get a regular month's invoice on 1st, and then migrate on say the 7th of next month, should the bill you have not yet had when you start the migration be considered part of the early termination charge?

My current plan is to specify that you are migrating on date A, and the min term is the later date B, and so the charges from A to B are X months at £Y making an "early termination charge" of £Z, and then noting that you have already been charged up to date C and so some of that charge has already been invoiced. Even that is complicated and may need a further note that you will get your regular invoice for services from date D as normal which may be before date A. How the hell we word this in a way that is clear to the end user is not easy.

This is with what I consider a "simple" system that is just a minimum term.

Combined services

This is where is gets really complicated. We have PSTN lines / copper pairs provided only for broadband service, so you have two services. We could get migration notice for one or the other, or both.

For a start, if we get a notice for the broadband we have to warn the end use that we will cease the copper pair when the leave, killing their migrated broadband if they don't also arrange to migrate the copper pair.

But the "early termination" is complex. The amount we'll charge for minimum term will be for both copper pair and broadband to the end of the min term. So do we tell people that when we get the notice for the broadband? We'll have to explain that both services will stop early.

Of course migrating the copper pair could mean keeping the broadband with us, so maybe that just needs to give notice for the copper pair min term. Though we could possibly not have a minimum term on that aspect of the service - that would be one option. It complicates matters for customers when some parts of the service have a minimum term and some do not.

Multiple lines

Of course, it gets even more complex with multiple line services. Our Office::1 service is two lines minimum. We do not do a single line option, or rather the price for a single line service is same as two lines. So if migrating one line, there is no early termination charge - you just keep paying for the service at full price on the remaining one line service. But do we say that (no termination charge, service continues on one line) for each notice we get? Or do we some how work out that we have had notices for the other line as well or the PSTN for the other lines?

Office::1 requires our PSTN lines, so is a migrate for PSTN to be treated as ceasing that line for broadband too? Maybe migration of any PSTN or broadband on Office::1 should be taken as notice to cease the whole Office::1 package? That may be the simplest approach here.

We have similar issues with units tariffs where the units part applies to the lines collectively. Cancel one line of a set and the units charges carry on - cancelling the final line, and the units charges stop or have some early termination charge. Now this is where making units tariff simply bill/credit to the migrate date and not have an early termination charge is simpler, but making it one month minimum term would still leave a possibility of complications. Maybe simpler for the units not to be refunded - i.e. they are charged in advance and applied in full to the period, full stop. After all, we already have complications if someone has a month of units and uses all of that data and leaves mid month. But is that fair?

I think, simplest option will be, PSTN/copper pair lines have no minimum term, and can be migrated away independently of the broadband (except Office::1). That solves the migrates on that side. I think that any multiple line package with a migrate of broadband should be treated as ceasing the whole multiple line service. That should make the notice clearer and can explain the costs for minimum term on the whole service.

No refunds?

One system we used to have when we first started was a no refunds system. It meant you are billed in advance each month. If you leave by any means during that month, you are not refunded for what you have been billed. This is a much simpler system that 30 days notice, and makes a lot of this easy. It is not an early termination charge, so nothing special needs to be included in the NoT. It would work possible for units billing and services beyond the minimum term, but those within minimum term could still be complex. It creates issues for people paying quarterly. It creates issues with all ceases and migrates being last day of the month. Whilst it is simpler in many ways I suspect it is not that viable.

Trying to be fair...

I wonder how else we can do this and be fair to customers. We do have minimum terms we pay carriers for things like FTTC and even one month on ADSL. The Home::1 and Office::1 are priced on the basis that we have the customer for a minimum term as we make a loss on the set up charges. I suspect we have to change some of the logic for what we do in terms of notice and invoices/credits on leaving. We also have to consider how these changes relate to people who cease services rather than migrate away. We could go for simple 10 day notice for cease to align with the migrate system in the way it works, perhaps.

Other ideas?

One idea was to simply have a fixed price early termination charge on the minimum term services. This would fit the OFCOM model perfectly, but it means OFCOM dictating the terms on which we do business, which is getting silly.

Why the hell do OFCOM have to meddle?!?!