Tuesday, 28 February 2012

The soul of a new rack...

Well, I am pleased to say that the old rack is now gone - literally - we wheeled it out of the building and loaded it on a van (well, Paul and Andrew did).

The new rack is in use for our broadband services and Ethernet from London, and DNS. It does our core connectivity.

There are 10 shiny new dual power fed FB6000 series FireBrick routers in the rack and a couple more to come. They use around an amp between them! There are lots of gigabit links from BT, LINX, LONAP, Level 3, and others.

BT have to finish the move of the old host links over the next couple of days, and then we will have four gigabit links to them in the new rack.

Room to expand.
Room to handle the Olympics demand?

Well done to all my team for their work on this. Dust will settle and all will be well.

Mind you - BT take some space!


Oh, and yes, every FB6000 has a custom engraved front panel.

Tighter timing on routing on PPP

When we get a PPP connection (over L2TP) one of the first things we do is a RADIUS request to authenticate. That provides the IP routing which may be single PPP link IPv4, blocks of IPv4, and blocks of IPv6. We even allow IPv6 tunneled to IPv4.

What we used to do is add the routes at that point in time, then negotiate PPP IPCP, and IPV6CP. When the session closed we did RADIUS STOP and then removed routing.

The RFC prohibits sending packets until IPCP or IPV6CP is complete and we had a check for that which dropped the packets.

We have made some changes which should come in later this week which makes the routing much more closely tied to the IPCP and IPV6CP state.

The main impact is bonded and fallback lines. The few seconds that can happen on IPCP and IPV6CP negotiation at the start, and for RADIUS STOP at the end, can mean routing is up but the link drops the traffic.

So the new plan is that immediately on IPCP ACK received we announce the IPv4 routes, and immediately on IPV6CP ACK received we announce IPv6 routes. IPv6 tunneled over IPv4 are on IPCP ACK as the PPP level traffic is IPv4.

We also drop the routes immediately on LCP TERM or L2TP CDN, before waiting for RADIUS STOP confirmation.

This should avoid even milliseconds of routing where there is no link.

Subtle, I know, but am improvement.

Source filtering L2TP

One of the things we do as standard is source filtering of L2TP connections. It means that our customers on broadband cannot spoof packets. This is standard on Internet services and Best Current Practice.

It is not filtering their Internet connection as the packets are not validly from them if they have the wrong source IP, IMHO, so lets not start that discussion.

We are quite comprehensive, catching IPv4, IPv6, tunneled IPv6 with us as the endpoint, and 2002::/16 prefix IPv6 using their IPv4 address space, and even when that is tunneled over IPv4. So quite generous!

The way it works is that when a packet arrives we look up where we would send that packet. If to the same L2TP session then it is allowed. If bonded to multiple sessions and one matches, then it is allowed.

We even go further, and if there are lower metric L2TP sessions for that target IP we check those too. The reason is that someone may have main and backup downlink on the same IPs but want to bond uplink - sending from those IPs from either or two (or more) lines. It works well.

The issue is when we have multiple LNSs. For bonded downlink we ensure lines go to the same LNS (by using a hash of the login to direct sessions), but there can be any number of reasons for this not to be the case including equipment failure and planned LNS changovers.

We have a system to pick up split LNS line groups and bounce lines to fix this.

But still, there can be a problem - a window of opportunity to be broken. The issue is that the routes are shared by BGP, so if line 1 is high priority on LNS 1, and line 2 is low priority on LNS 2, then LNS 2 sees the best route to send traffic to LNS 1 (by BGP) and so does that.

For downlink routing that is fine - the split LNS sends data down one line not bonded. Not as fast but it works.

The uplink breaks though as LNS 2 does a route lookup and finds it would route to the other LNS not to an L2TP session, and so blocks the uplink traffic. Ooops.

So, much coding today, and much testing, and now the forwarding system has a list of L2TP session for source checking on any route type, even BGP. So it sees BGP to other LNS as best route but sees it also have one or more lower metric L2TP session route, and so allows the uplink traffic if the session matches.

New code in place, more testing to do, but may be deployed this week for real.

PS I promise not to take AAISP to ADR!

Yes, we are getting support tickets saying this.

One says "PS I promise not to take AAISP to ADR!"
And another says "PS I promise not to use ADR if you can't find anything :-)"

I don't know what to say - thank you!

I am getting all emotional that I have such nice customers :-)

I think I owe a few pints at the next AAISPISSUP

Monday, 27 February 2012

Thanks for the support

I am really pleased at how much support I am getting on this whole ADR thing. Even people who have used ADR and won awards are giving sympathy and advice.

I do not know how it will end. It is causing more stress than anything before because it so fundamentally upsets my view of the world. I really thought I had a handle on contracts and liabilities.

I really hope the outcome is a change in the way ADR works. If we cannot win this argument we need to go higher - writing to MPs, complaining to OFCOM, judicial reviews, whatever. It simply cannot be allowed.

I really hope this does not change the way we work with our customers.

To be clear - this is a business that wanted to be a cheapskate and stream live video commercially on ADSL lines for the Royal Wedding. We provided the service they wanted in time for the event they wanted for the price they expected (well, according to them within £6.94, and we credited more than that). We bent over backwards to get the service installed and working in time even though we faced huge obstacles. The ombudsman agree we were not in breach of contract.

There is no way that any sane legal framework allows us to be penalised for that.

For journalistic purposes, criticism and review, I am quoting one paragraph from the initial decision. I believe this is valid under copyright law. It is the final paragraph addressed to the claimant.

"You complain that you have incurred direct losses as a result of the delays AAISP demonstrated. You have not provided me with any evidence of this, and in any event it is not appropriate to recommend that AAISP make any such payment to you because it has not breached its terms of contract, which also preclude such claims being made. However, I am mindful of the multiple shortfalls in AAISP’s service, and the likely additional problems that they have caused you. I am satisfied that warrant a fairly substantial goodwill award, of £500 in addition to the £200 referred to above."

In this one paragraph we see the contradiction. The fact we are not in breach of contract. The fact it would be inappropriate to award losses as we are not in breach of contract. Then the award for loss of convenience with some magic sum that they do not explain and call "goodwill".

That should simply not be valid.

First casualty

Someone would like us to try to configure a Juniper router for them to go on one of our lines.

I was inclined to say no anyway, but normally we would at least give it a try, even though we have never set these up, and he did not get it from us.

But this nagging feeling hit me - if we try - and we fail (quite likely as we have no experience of them) - we could be liable to ADR.

So I said no.

Maybe when this is all sorted I'll be less paranoid.

Sunday, 26 February 2012

Apology

One of the aspects of the (not yet final) decision of the ombudsman is that we have to make an apology.

This is slightly confusing to me - obviously in the various emails sent we have apologised for the time it is taking to provide various services, so we have already complied.

But ultimately they agree we are not in breach of contract, so I assume an apology along the lines of :-

"We hereby fully apologise for any inconvenience caused as a result of us providing goods and services to you in accordance with the agreed contract terms"

is what they need. That is what I plan to say if required to do so. It is the truth after all. We really are sorry that we ever agreed any contract with these people now. Very very sorry. Definitely won't happen again.

Quicker court action for non payment of bills?

Normally we will take a long time before taking someone that owes us money to court. We try and sort any misunderstandings or disagreements and get the money by amicable means. We even allow people to pay by installments. There are lots of things we try to do to be reasonable and not heavy handed.

However, one of the shocking things about this ADR case is that they took the case on even though we were taking the customer to court for unpaid invoices. They then went on to consider the unpaid invoices in their decision - requiring us to waive invoices for services that were provided and used but not paid for.

So it seems ADR can be used as a means to get out of paying your telco, even after the telco has started proceedings in the county court.

So, in future, given that they won't take a case until it is 8 weeks old (well, that is what they say, but who knows what they do - contracts and agreements mean little to these people), we will have to ensure any case where someone owes us money goes to court in enough time to get a decision by the court within 8 weeks.

We would hope that is a dispute is resolved by the court then it cannot go to ADR at that point. Again, this is a huge assumption.

That should be slightly easier on business to business disputes as the court allow a lot less time for a defence to be filed - but even so 8 weeks is pushing it.

It means we'll have to start court proceedings after a week or two at the most if someone owes us money and has not paid.

Otherwise the customer can blackmail us by threatening to take the matter to ADR, knowing that we not only have to pay £335 even if we are right, but we could have the invoices waived and substantial good will awards made against us, even if we are in the right too.

If that is what OFCOM want by insisting on ADR, then that is what they get. Sorry.

P.S. To qualify this - this is obviously something of a devil's advocate type of posting, and we are talking about people refusing to pay. Once the dispute is started there are 8 weeks in which we can get the dispute resolved by a real judge following normal contract law. If we fail then we are subject to some alternative reality legal system that charges us when we win and even awards penalties against us when they agree we are in the right. That is when we have to act quickly. We have no choice - as a director I have to act in the interests of the shareholders and the company - that is the law.

Coding

We are trying to get a FireBrick s/w release out today. It has been a while since the last one (over a month now) and we have made a number of small changes as well as adding some cool features.

We are about to embark on some major work - not just new features (hopefully a SIP gateway) but some associated rework of parts operating system. These will need some careful testing in alpha releases. So this is a prefect time to make a new factory release before we start that work.

Unfortunately, last night, we had a slight issue on one of the A&A LNSs. As we are waiting on BT moving one of the host links we are still running on two LNSs, one live and one backup, just that they are the new ones in the new rack. At around 8:03pm last night the live LNS stopped doing any RADIUS lookups.

It did not affect any existing connected sessions but anyone dropping and reconnecting for any reason could not get on line. The problem was not picked up by nagios and it is hard to work out what it would test that could be reliable to detect this - it is a new problem. Thankfully someone did send an MSO text and I was able to re-route new connections to the backup LNS. Before you ask why the backup was not being used anyway it is because the live LNS was still accepting the connections at the L2TP level - the behaviour from outside was the same as a duff login, and so not a reason to use the backup LNS. This meant the few people who could not reconnect were now back on line and everything was working. Overall almost all of our customers totally unaffected (apart from not being metered for usage for a while).

This meant I could try and find the cause of the problem. Eventually at around 1:40am I moved all lines to the other LNS and reset the problem LNS. This meant everyone reconnecting, which happened really quickly (the recent changes to our RADIUS authentication and accounting servers worked well and kept up). A few lines somehow hit a BT default accept and tried 10 minutes later but I am not sure we can do anything about that, sadly.

So this leaves me investigating what happened, and potentially delaying the factory release. The analysis last night basically showed that a counter had gone negative when that should not be possible. It points to some unexpected race conditions / interrupt case.

Now, before I get a lot of helpful posts from the coders out there, we have been doing this a while. In embedded coding you have to understand what happens at the processor instruction level and allow for interrupts and task switches at any possible point. It is easy to code something where there is a very very slim chance of an interrupt at the wrong moment causing problems. If you do that you can bet that it will happen, one day, and break stuff.

It is probably worth explaining this a bit as I know we have non coders reading this as well as people that have never done embedded coding.

In a high level language you might have a simple, and innocuous seeming instructions such as x++. This means adding one to the value of x. Simple. You consider it to be an atomic operation, but actually the processor will be reading x, changing it, and then writing it. It is possible to get a task switch or interrupt between these operation. Now if some other task or the interrupt does x-- (taking one off x) as well. Normally the two things result in the value not changing, one thing adds one, and another takes one away. But consider the bad timing of the interrupt as follows:-
  • x starts with a value of 10, and you are doing an x++ operation.
  • You read x in to a register, 10
  • You add one to the register, 11
    • There is an interrupt which reads x as 10, takes one from it making 9, and writes x as 9, and returns from interrupt.
  • Now, back in the main thread, you carry on and write the register, 11, back to x.
The end result is that x changed from 10 to 11, even though code to add one, and code to take one away, has been run. x is no longer the value you expect.

The problem then is that the consequences of a wrong value can cause something to break much later on. If x is the number of items in a list, say, then things might not break until x gets to 1 but the list is now empty.

So, in embedded coding you have to be careful. Very careful. The fact that the compiler will hold variables in registers for extended periods is one issue. The fact some variables may take more than one memory locations (e.g. 64 bit counters) is a problem too as the read or write of the two locations could have an interrupt in the middle! The issue is even more complex when you have two processors able to access the same memory. This is why we have memory management at a hardware level, schedule locking, mutexes, and interrupt locked code as appropriate.

My challenge this morning is finding the mistake. It is code we have not changed for some time. Usually bugs happen as a result of coding (or "enbugging"). Mistakes are made, and interactions are not thought out and that results in a bug. But this is not new code. In fact, the change that may have triggered it is the likely to be the recent change on our RADIUS server making it a lot quicker to respond. This will have impacted the timing of the RADIUS requests and replies and the associated L2TP connections. It could be the explanation of "why now?".

The issue, so far, having looked at this until nearly 3am, is that the code is written with belt and braces - the only place the counter is touched is within one very carefully coded short function which inhibits interrupts around the relevant section of code. The counters are locked in the memory management to the one processor even. Basically, this can't have happened. They are the fun bugs to find!

Saturday, 25 February 2012

Not our fault?

This is really trying to spark some debate and not give a definite answer.

One of the things that has come out of this whole ADR fiasco is that there was a problem caused by BT. But our contract is with BT and our customers contract is with us and not with BT.

So what happens when there is a problem like this.

Now, I fully understand that the customers contract is with us and not with BT. We usually give details of what the cause of any problem is, and what we are doing to solve it. This means that if we are the cause we say so. If BT are the cause we say so. We are not trying to "excuse" the fact things are broken by passing the buck or blaming someone else - we are simply trying to provide truthful and honest details of what is happening and why.

In contract terms we take some care to mirror the terms we get where we can - where BT do not guarantee a delivery date, or experience tells us they will not meet targets, we don't try and offer any guarantee to our customer.

This means we can find that we let people down on some occasions, whether our fault or someone else's - it is *us* that has let the customer down.

We make sure we don't have liability in such cases - other than not charging for services until they are finally, possibly delayed, installed. We don't make money until they are installed, so we lose out as well.

We thought that was good enough - being up front on what we can and cannot guarantee and what remedy there is for delay, if any.

It seems however that when someone else causes us a delay, even though the delay on our part is not breach of contract and not going against what was agreed in any way, somehow we are liable to pay for the inconvenience caused. Yet we are not able to do the same to our suppliers. We have no way to insist our suppliers ignore the terms we agreed and compensate us.

So what should happen?

Is it right that the agreed contract is what matters? i thought so.

Should we take the hit on compensating someone for a delay that was not our fault when we never agreed to a date in the first place?

Should we be able to pass on this liability to our suppliers some how? If that is not down to contracts, what legal framework should allow that?

What if the delay was our fault, and we did not guarantee an install date? is that different, or just covering our arse as effectively?

Is it really fair that we can offer a service without guaranteeing a date, and the only compensation being that the customer does not start paying until the install finally happens? Is that fair and reasonable? I thought so.

I am interested in feedback on this as I have actually toned down our T&Cs to allow for delays that are not our fault more clearly. i.e. that there is not charge for service we have not installed because of a delay, even if the delay is not our fault. I think that is fair too.

Whats the view here?

Calmer?

I have been coding this morning to take my mind off things - the FireBrick now has a load of "local DNS" functions which we're going to expand on later to allow more general wildcard DNS blocking and the like.

That has, slightly, calmed me down. To be honest I got no sleep, hence posting comments here at 4am.

I just cannot get my head around "we agree you are not in breach of contract, and the customer is in breach, but you still have to not only pay us £335 but pay the customer £1200 as well". It just "does not compute". Being not in breach of contract means we have won the case, surely. And they say they consider the law and contract. They say no award will be punitive. They say that an award for losses would be inappropriate as we were not in breach. Yet still they make an award.

However, the next step is go through the decision (not yet a final decision) step by step - identify any factual errors. I have also asked them to send us the actual "claim" as it seems from their decision that it is far beyond the scope of what we have already dealt with from this customer. It is strange we are expected to defend ourselves without knowing the accusation! Just another aspect of the alternative reality dispute resolution process - all normal process and legal frameworks go out of the window.

They will review it and make a final decision. If they do make a final decision that includes this punitive "good will" payment, then that means getting some legal advice. Can we ignore them as the decision is not "according to their terms of reference", and what wrath of Otello or OFCOM would insue if we did. We don't know.

This is meant to be an alternative to the courts - easier to use and less risk for the consumer or small business. But I had no idea it meant alternative law and logic.

I'll keep you posted.

Oh, and the new rack - yes, I know, pictures or it didn't happen... We are waiting on BT moving the old host links which I think is Monday or Tuesday. We then take a van in and remove the old rack and the kit. From what I can see everything is moved over now apart from that BT host link. We will have some tidying up next week, making sure the two host links are working as expected, etc, but we are nearly done.

Thank you all for your patience in this.

Friday, 24 February 2012

Below average

Well I know doing stats at school was rare, but surely everyone knows that "half the population are below average". It is kind of (one of) the definition of  "average"!

Apparently not, Julia Stent states on the BBC that :-

"Britain might be riding the wave of a super-fast broadband revolution, but for 49% who get less than the national average broadband speed, the wave isn't causing so much a splash as a ripple," said Julia Stent, director of telecoms at uSwitch.

Really how thick can you be?

New scam

Clearly ADR is (IMHO) a scam...

So we are missing a trick - we need to re-org A&A to be under 10 staff. Doable. Maybe make a new company for A&A (retail) ltd.

Then we can take our favourite telco to ADR for "customer service" issues pretty much every day.

Each dispute would not be a breach of T&Cs, no, but would be poor customer services justifying a £500 "good will" award to us.

We have to seriously consider this scam now.

P.S. just asked how we can take advantage of this scam to the ombusman as clearly T&Cs do not matter.. We want good will payments for every "our favourite telco" cock up - i.e. several a day... No reply.

Changing terms

Well, the whole ADR thing is causing several small changes to T&Cs for A&A.

They are pretty small to be honest - the first is the "customer complaints code" which now says that to make an official complaint you have to tell us:-
  • The exact amount you are claiming
  • How you worked out this amount
  • What steps you have taken to minimise this amount
  • What exactly we did wrong
  • Why this was a breach of contract. i.e. which clauses exactly. If it was not a breach of contract you have no claim.
  • For any claim relating to an ongoing service that was not working for a period, then, for each specific service (i.e. broadband is separate from annex M and separate from email, etc)
    • In what way the service was not working
    • Why was the service not working our fault
    • When you reported the service not working
    • When the service was fixed (or ceased, if not fixed)
    • Exactly how much you were charged from when reported to when fixed or ceased. Please quote the invoice numbers.
  • If the amount you are claim exceeds the limit of liability agreed in the contract (i.e. what was charged for each service for the period it was not working due to our fault), then explain why you believe the limits in the contract do not apply. You cannot claim more than is agreed in the contract.
That is meant to cover this case - the claimant never made any clear claim in the first place, and did not state the amount or why, or what we did wrong or why that was breach of contract.

So that change alone should help a lot. To be honest I think this is a small change. When shit goes wrong people get cross. You need to focus. You need to be able to say what you want and why. That helps massively. So doing this up front makes it easy for us, and if ever it goes to ADR it helps. We also added a whole section on "Step 3: Feedback and rants" on the basis people have some way to just complain without making a "formal claim".

To be clear, and we say this as well, we can make good will payments (if we decide, not if ADR decide) if things did not go well even if not our fault. We do this. Of course, if doing this makes us somehow more liable we'll stop, so I hope not. But we want to be fair. Problem is that ADR is by definition unfair, which is odd as their own terms say they should be fair. It is unfair as one side pays regardless for the case. Why are they even using the word "fair" in their terms when clearly they are not.

The big issue is that this would have meant there was no case - we did not breach contract. Sadly, having changed our complaints code we now realise that the fact we did not breach contract is irrelevant!!! WTF!

So other changes in the main terms. We state "We expect our staff to always be helpful and polite, and we expect customers to do likewise - however this is not part of the contract. If we, or you, are rude or unhelpful then there is no compensation for this either way in the contract. We may terminate a call or chat if we feel you are being rude, and we would happy for you to do the same if staff are rude. We are happy for your to provide feedback on such cases and we will endeavor to address them (again that is not part of the contract). Just to be clear - we are not agreeing to compensate you for causing stress or inconvenience either."

Again, I was hoping this meant that no accusation of being rude or unhelpful would count as not breach of contract. Little did I know that not being in breach of contract did not count!

So finally we have added "Alternative Dispute Resolution (ADR) is a service that allows you to make a claim without going to court if you think we have done something wrong and we owe you compensation. It is important to realise that this contract has clear limits on our liability even if we do something wrong, and you have agreed to those limits. This includes the fact that nothing is due for being rude or poor customer services, only where we are in breach of contract. ADR can take complaints for many reasons, but not about our terms and conditions. However, as we are not confident that ADR will limit any awards to the limits agreed in this contract, it is a term of this contract that if you take an issue to ADR and are awarded an amount in excess of the limits strictly due under these contract terms, then you will immediately re-imburse us that excess or allow us to deduct it from the award. The arbitrator claim to consider the law and our terms in their decision, and so this clause should never be needed."

I have no idea if that holds water legally. It should, IMHO, because it says "this contract counts". It says "what we agreed was the limit of liability really is the limit". It should not need saying. It is normal contract law that if we agreed a penalty for breach then that is what we agreed. Saying we can enforce the contract over what an ADR says should be a non issue as they are meant to take in to account the contract and the law. Clearly they don't do this, so by putting this in, in theory we can sue someone that takes us to ADR and gets a silly award, or simpler we just withhold the award as agreed in the contract - something they cannot argue about.

This does not stop them using ADR though - they can - and the things ADR are good for like billing errors, are covered and no problem. Basically, anything ADR should cover, within the contract, they can. We would never expect any dispute to ever get that far as we will happily fix any mistake we genuinely make.

I can only hope we have another 15 years before another case like this and testing that contract clause.

I hate having to be "mean" to anyone, but really - contracts should rule. They say what we have agreed, and that should matter!

Rude ISP

We want to be nice and helpful to customers. I actually have pretty high standards for how my staff deal with people. I even expect them to be reasonably polite to "our supplier" if possible.

We want to offer a good service - we want to try our best to get things working for when you need it. We don't just do "computer says no!".

But I am thinking there is no point - whatever we do we will get stitched up - contract terms don't count.

So I was wondering - maybe we should set up as a "rude ISP". Right up front we say "we will say it how it is - we don't make any promise not to be rude - we are just straight up talking, in your face". Maybe state that we don't aim to avoid causing stress or inconvenience - all we aim to do is what we agreed in contract.

Would that help? If we were a rude ISP instead of a helpful ISP? Would that avoid the possibility of getting some random award against us.

I am thinking of writing to Otello and asking - ask if we state we are rude, could someone take us to ADR and get an award for us being rude. Perhaps ask if they could get an award if we were polite, i.e. not being rude?

I just don't know - arrrg! I don't know the rules by which the world works any more!

Business, Contracts, and law

I grew up over a shop - as my parents ran a business, and I have been running A&A for nearly 15 years now. I have been in business a long time. I have even been involved with a company that went under many years ago, costing me a lot of money. Lessons have been learned in many ways.

In all that time I have entered in to many contracts, as a consumer and a business. Personally I have argued contract terms with businesses and sued a few. As A&A we have taken people to court and won, and even lost in one case. Many more lessons learned.

I don't claim to be legally trained, but decades of experiences, and a contract law book or two, mean I have a good grasp on contracts. To be honest, you have to in order to be in business.

The concept is simple - an enforceable agreement between two parties (well, sometimes more). Each party agrees to do something for the other - often one side providing goods or services and the other side providing money. Each side has to do what they agreed. If not, they have to compensate the other side for their losses. The courts will enforce the contract if there is a dispute and have power to order bailiffs and the like.

That is it in a nutshell - but there is a lot more. Just understanding the agreement and resolving ambiguities, for a start. Sometimes contracts are not written down. Sometimes there are implied terms by custom, or by law. Sometimes there are special nanny state provisions to protect consumers. It is not as simple as it sounds.

But that said - the whole idea of contract has been a cornerstone of all trade for most of the world for many centuries. It is enshrined in common law and legislation.

Of course a business to business contract is often easiest as businesses are expected to read the terms before agreeing, and at least understand the basics of contract law.

Contract law is one of the two main civil ways you can get a liability for something. Breach of contract means compensating for a loss - the compensation is not to punish you or benefit the other side - just put it back as it would be if you had not breached the contract. The other main case is tort, but where there is a contract in place then tort does not normally apply - your duty being to do what is agreed in contract. Also tort is complicated and means showing there was some duty of care that is breached. The only other ways to get a liability are criminal, and that is where the result is a punishment (money, prison, etc).

The county courts small claims track is all about contracts. Mostly they are resolving a misunderstanding in a contract, and deciding on costs for a breach. They probably spend half their time explaining the principles of contracts to the parties, especially where consumers are involved.

I thought I had a good handle on it until ADR came along, or as we call it now "Alternative Reality Dispute Resolution". But no! My world is turned upside down by this. It undermines all I understood about doing business. It undermines centuries of common law.

They agree we are not in breach of contract. Normally that would be all of the hard bits of a contract dispute sorted. No breach of implied terms, no misunderstanding of terms, just not in breach. Yay!

But no - somehow, even though they agree that there are no defined losses to compensate for (and that as no breach of contract then compensating for losses is not appropriate), and even though the terms make it clear that no award they make is punitive (i.e. punishing us), they decide we have to pay £1200 (waive £700 of charges due and pay £500 good will).

A court would not have the power to do that - contract law does not allow it.

So how the hell can I do business, when the contract no longer matters - when someone can be awarded random large sums with no basis in law even if I have done what was agreed in contract. I am so flabbergasted by this I don't know what to say.

I have often thought of taking a course on law, specifically contract law, and may have to do that now.

ADR Arses

So, I was right to be cynical of ADR.

They clearly have not read the correspondence as their long explanation makes a number of statements that are just wrong, in my opinion.

However, they have stated that we are not in breach of contract - that should be a win for us. They also state that even though the claimant says he has losses, that as we are not in breach it is not appropriate to award compensation for such losses.

However, they have stated we have to waive charges for ongoing services, even though there has been no dispute of these from the claimant to us and the claimant had not asked for services to be stopped at any point.

And then, on top of that, they want us to make a good will award of £500.

This is plainly outside of the agreed contract terms. There is no law requiring us to pay anything if we did not breach contract terms. There is no law requiring a good will payment. So how the hell can they award this?

This means us being out of pocket by over £1500 - even though we did not breach our contract terms and the ombudsman service agree that point.

How can that be valid?

I think we have to fight this in court now.

Apart from all sorts of issues with this process - the few paragraphs we were sent when the claim was started is that the customer would be happy with a resolution that simply meant being released from contract without penalty. As we have already done this, we said that we agreed with their proposed resolution. In light of that we cannot see why they even took the claim on in the first place.

i.e. we had already resolved the dispute, already issued some good will credit (over £200), and already ceased the contract with no penalty. Yet, they take the claim costing us £335, and then decide we have to write off £700 owing and pay a further £500. Clearly this who system is some sort of scam, in my opinion.

Thursday, 23 February 2012

Busy

Well, fun week - more work on moving to new rack and finally ceasing the old rack circuits and so on. Making some progress. Pictures soon, honest.

Two days training - some mates of mine - hard work - and a tad embarrassing with some things not playing. Looks like Zyxel DSLAM we have in office does not pass IPv6 PPP - WTF? We need to make things way more slick.

Several minor tweaks on the bricks as we expect - courses are damn good feedback. But this time far too many broken examples that need fixing. Sorry guys - but allowed me to show off a lot of the debug and diagnosis stuff :-)

I am still seriously planning SIP router stuff next week - but we have to be very clear on selling as a SIP router and not competing with asterisk. This is a packet level box we are making not some PC.

I have my doubts that I will manage it next week - insulin up to 20 units is getting better - but not the same as the film limitless just yet - I expect it to be like that as I get the dose right :-) I expect to be back to my old self - able to code in no time and not need sleep.... Dreaming, damn!

And on top of all of this - the race for a new toon to get to level 85 on WoW. I am getting behind (only 82). Gave up on SW:TOR as really just WoW with light sabres...

All things a tad disrupted this week by a friend being killed on a motorcycle. My daughter's best friend's father. Only 52. Very sad. We heard whilst at a meal on Sunday (for my birthday). Still sinking in, and the second friend I have known killed on a motorcycle. They clearly are dangerous things.

Tuesday, 21 February 2012

What are you all downloading!

I do not usually post any specific figures on usage and links to BT and so on, but I am quite surprised on this one.

On BT 21CN we have an (expensive) commit level. We generally try to upgrade by 10Mb/s or 20Mb/s or so every now and then as usage increases. We are trying not to be the bottleneck. So if usage is hitting the limit, even if only briefly in the evening, we order more capacity to BT.

This is something we are doing a couple of times a month.

This month we went for a big jump - from 500Mb/s to 600Mb/s of 21CN. We have 20CN and BE on top of that - hence upgrades to the network to allow over a gigabit. It is mainly due to the network upgrades I have felt happy to push things this far now.

That was a huge and very expensive upgrade - 20% more capacity because, some evenings, usage was hitting limits, a bit, maybe. It is hard to pin down as our stats are screwed up this month and last month by juggling lines between the LNSs.

Still, we are a small ISP and and extra 100Mb/s of WBC is not cheap, but we really do aim not to be the bottleneck, and if that is what we are telling you, we need to put my money where our mouth is and bite the bullet and have the bandwidth.

Previously the incremental upgrades work well for a couple of weeks with no limits being trashed for a bit. I figured such a big step would really get us to zero dropped packets...

To my amazement, over the last few days, you are just grazed the limit at 600Mb/s. Not damping down the speeds of lines generally, but just hitting the limit.

I am shocked, and probably going to have to up the limit again by a big amount.

It really is quite a lot of extra usage. No - not planning on any more price increases just yet, honest. But still! just what the hell are you all downloading?!

Once this major upgrade work is all sorted I am really keen to be on zero packet loss for the links to BT for 21CN at least. 20CN is rarely an issue as we move people off it on an ongoing basis, and BE is not capped anyway.

Fun game this!

P.S. We started broadband services with 2Mb/s to BT and UUNET over 10 years ago now.

P.P.S Sorry, adding to my post - adding specific figures confuses the hell out of people. There is a huge difference between back-haul bandwidth and individual line speeds. An FTTH doing 100Mb/s can make a huge difference. What we are looking at is the sum of the small averages and statistics, so if this makes no sense to you - don't worry - you are not alone... :-)

TR069 progress

Well, having knocked up a TR069 server one morning last week, I have been trying to use it for real this week.

So far I am pulling my hair out. It seems there is no definition for the "user.ini" config file used on the router, so that is trial and error. The router config is not part of TR069, and down to router manufacturer.

However, main things we want to do are to be able to load a new config in to a router, and upgrade software if needed.

Well, first issue is the "isp.def" file. This is the factory reset default settings. We cannot make one file that works for FTTC/H and for DSL usage. So we are having to make routers that are specific to one of the other. Not too bad, but some times customers upgrade from one to the other.

That leads to the next problem, we cannot change the isp.def file. So if someone upgrades from DSL to FTTC, we can change the config (while on DSL) to work on FTTC, but we cannot change the factory default. If they ever reset the router it will go back to DSL. Arrrg. Please Technicolor make software that allows us to change the isp.def via TR069.

We may manage some work around with allowing ftp access by a config change and ftp of the new isp.def or something, but it is going to be a pain in the arse.

OK, fair enough, we'll cope. But what about config file updates. Well, it turns out there are two ways to do this it seems. We can either tell the router to download a new "user.ini" file from http/https server, or we can actually send the config (up to 32K) as a value in a "SetParameter" command. The latter simply does not work - nothing happens, no response! Hmmm.

OK, the Download option. Last week we tested download of new fireware and new config (user.ini) file. Both worked. I expected today to be plain sailing.

This week testing with real DSL lines and routers, not so good. The router is being thick! It will use the WAN and the assigned WAN IPv6 address to DNS lookup the TR069 server (A record); use the WAN IPv4 to talk to the TR069 server; Get the Download command; Use the WAN DNS again to look up the host part of that URL; and then screw up! It will send packets from a 192.168.x.x LAN NAT IP address (not even its own IP on the LAN?!?!) to try and fetch the file. Eventually it gives up and correctly uses the WAN to tell the TR069 server that it failed. Non NAT config works thankfully.

At one point it appeared to be talking from 192.168.1.153 to 127.0.0.1. We know because the RSTs it replied to itself were being sent down the DSL line (from 127.0.0.1 to 192.168.1.253). I mean, WTF!

We are hoping that it is somehow something we have done in the NAT user.ini file and can be fixed. We'll see what we can do. NAT strikes again!

Oh, and https does not work - possibly it does not like the certificate (cacert). That will need more testing! But http does work, so a workaround for now.

The good news on that is that we can tell it to FactoryReset, and if it has the right isp.def file, it will then talk to us in a default config, which does work to get the new config. That is a clue I think that we have messed up the config somehow for NAT.

Good news is A&A control pages now have a factory reset and config load button, yay! These only show if you are on our TR069 server. If you have one of these routers and are not, ask on irc during the day and we can switch you over.

Next steps after working around these quirks are to improve the config options - allowing people to set up the router as they want, and even basic firewall rules, from our control pages. Once we have the mechanisms working that should be simple and we can expand that to meet any customer demand for features if the router will do them.

Oh, and for me to publish the TR069 server - yes - still planning to do that.

That said I am training for next two days so not likely to be a lot of progress.

Are we nearly there yet?

Well, yes, I am hoping to get some nice pictures for you, but not yet.

We found that unbound was not playing properly so the new DNS resolvers were taken off line until fix. All compiled from source now, and looks OK, so they are going back now.

We finally have the Ethernet hub in the new rack - so we can server Etherflows from Maidenhead and London now. And we have our first Etherflow on line to London yesterday.

Yes, I said "hub", by which I mean the centre of spokes of Ethernet services from London to other sites around the country, and not what you thought!

We have a last couple of wholesale customers being a bit slow to move over to the new rack, but that has to happen this week or they go off line :-)

BT had a blip on one of the new host links, so we have everyone off that while they investigate.

The last big step is BT moving the old host links to the new rack, and we are just waiting on confirmation for that now.

We are nearly ready to turn off the old rack - much latter than planned.

Friday, 17 February 2012

Take a small sip...

SIP will be fun...

For those that do not know, Voice over IP is mostly handled using a protocol called SIP. No Skype is some strange shit and not the same at all :-)

There are other protocols but SIP has kind of won the day on this, and for a typical business SIP is about proper phone calls and phone numbers and using "Internet" as the connection medium for those calls.

Now we have done SIP and VoIP generally for some time. Oddly we kind of started in the telephony side selling mobiles and even ISDN switches 15 years ago. So this is turning full circle but with more modern technology than ISDN.

Now, we have used asterisk, and since then I did my own linux based SIP server. It was clever. It did not do media but passed on sdp negotiations between end points and worked as a proxy. Works well for our SIP VoIP services but we want to do more and better.

We have this really good hardware platform, well platforms in fact. The FireBricks. So the plan is to make them do SIP.

I started with the idea that the gigbit boxes (FB 6000s) would be a core ISTP SIP gateway box and we would sell a few to larger SIP based telcos. But as the idea developed we realised that small businesses need SIP on their Internet gateway boxes.

So the idea is to put SIP in the whole range. The smaller FB2500 doing 100Mb/s Internet gateway and local SIP "server" for phones on the LAN. It would handle internal calls and route externally via a carrier (such as us, or many others).

Doing this bypasses all of the crap you get with NAT. NAT is evil! and screws up SIP in many ways. Those few with the right mix of SIP server, NAT gateway, STUN server and SIP phone that happen to play well enough are lucky. Mostly SIP and NAT is a nightmare, and even just SIP and firewalls is a challenge.

So making FireBricks do this is cool. It makes for a nice small office package.

Trick is, having done one SIP server from scratch (very much RFC based), I now know a lot of what really happens and how shit breaks. So the next version will be way better.

So, what can I say? - watch this space.

[now I have blogged I'll have to actually write this, but not for another week or two at the least]

Thursday, 16 February 2012

Refusing to fix an FTTC fault?

It is very frustrating when we are letting a customer down, and this is one of those occasions.

We have a customer with a Fibre To The Cabinet service (the same technology that BT Retail call "BT infinity"). His line has been broken since Sunday night - out of sync almost all of the time.

With FTTC, line faults are usually fixed very quickly and with no hassle. This is because the telco have their own kit (a VDSL modem) on the line. If that won't stay in sync then they can see it is broken, and as it is their modem it is clear that the fault is their responsibility. A fault like this normally has an engineer out right away, some times even same day, and it gets fixed. No arguments about who's fault it it or SFI charges or anything. Simples.

However, on this occasion, the telco are basically refusing to do anything. They have stated that they cannot take any further action, even though the line has been off for nearly a week and they have not taken any action so far. They have even taken to lying, stating the service has been restored which it has not been.


Clearly not acceptable. Clearly not working to their 40 hour fix targets. Yet nothing is happening. They even refused to take the case as a "High Level Escalation". They are simply not fixing the fault.

Apparently, their feeble excuse is a design flaw in their "systems" which stop them taking action when an order is open. Last week an order was put in to change the uplink speed on the line. Minor modification, not any physical change as far as we can see. But their system has the order "stuck" due to some mistake they made. It seems now that they think it acceptable to refuse to fix a fault in such cases. We have asked them where that is in the contract exactly.

In the mean time we have a customer with no Internet access for nearly a week and no way to get it fixed for him.

I am in Leeds until tomorrow, but I know my team are pursuing this as much as humanly possibly - but what does one do when a company point blank refuses to fix a fault?

Wednesday, 15 February 2012

Insulin++

I have been injecting for a week now, once a day...

To be quite honest it is no more hassle than taking tablets every morning. The injector pen makes it simple and easy, and it is not painful. Mostly you cannot even feel the needle.

They start slowly, 10 units a day and increasing 2 units every 3 days, so I am on 14 units now. Blood sugar is not really improving much, but that is to be expected. Apparently I will not start to see much happening until on around 20 units.

However, already, I feel better. The tablets were not working well anyway, and just make feel grotty, so simply being off them is a big help. I am also not feeling thirsty all the time, or as tired. So progress already.

In hind sight, I wish they had not bothered with tablets in the first place. It was just putting off the inevitable as diabetes runs in the family. I just did not really want to admit that I was getting old I guess. We all start falling apart eventually.

Thanks for the well wishes from friends and relatives.

Off to Leeds

We have the initial TR069 server, and are now working on integrating with the control pages. Overall, TR069 seems pretty simple, to be honest. Not sure what all the fuss has been about.

That said, I am off to Leeds until Friday, so probably not a lot of progress on this in my absence. I think everyone is far too busy mopping up the aaisp.net.uk to aa.net.uk change and are secretly hoping I don't have Internet access to break anything else this week.

Moving some more direct links over to new rack, and getting ready to close down the old rack - more mopping up...

Well done to my team for coping.

DNAME

In DNS there are many types of record, and one of the new ones is a DNAME.

It is a non-terminal substitution of one point in the DNS for another. Unlike CNAME which only relates to the specific record and not any below it.

As an example, we have a DNAME for aaisp.net.uk pointing to aa.net.uk

The principle is simple: anything.aaisp.net.uk is mapped to anything.aa.net.uk.

The name server serves the DNAME record, allowing a modern resolver to remember this mapping and not ask for other records knowing they are mapped. However the name server also servers a zero TTL transient CNAME mapping the specific entry being looked up. If it then knows that entry too it will return that record.

In theory any older resolver will see the CNAME and following it. A newer one can cache and understand the DNAME. So it should be backwards compatible.

Interestingly we have found a snag. It seems some very old (and I mean very old) linux resolver libraries can't handle the DNAME. Specifically calls like gethostbyname barf at it. The clue is that the problem is logged in syslog. e.g.:-

Feb 15 08:37:26 a asterisk: gethostby*.getanswer: asked for "wasteless.ec.aaisp.net.uk IN A", got type "39"

Sadly there is no easy work around. Well, apart from the mopping up we are doing changing references to aaisp.net.uk to aa.net.uk. Maybe bind can be told to understand a DNAME but not serve it? Maybe it can be made to work out the resolver is old and not serve it. In fact, I think that is meant to be the case.

Anyway, we have cheated for now - pointing aaisp.net.uk at our own servers which now understand DNAME but don't serve it - they just serve the transient CNAME.

Annoying.

Tuesday, 14 February 2012

Starting TR-069

I am starting the TR-069 ACS project, at last.

We were somewhat thwarted by the fact we could not get the damn router to even try and connect. Turns out it won't follow a CNAME in the URL you give it - expects an A record. I have yet to see if an AAAA works. This is the main thing that led to us finally changing the domains around yesterday. I suspect we still have some mopping up on that this morning.

However, I have got as far as seeing that it is posting a SOAP encapsulated chunk of XML which looks very much like the specification says it should. I can also "poke" it using a specified port to convince it to do a "call back" to the ACS.

So, the plan at present is a single C code tool that runs as index.cgi under apache. I may add a server mode later that does listen and HTTP header stuff, but no need for now. Seems it will be fine under apache.

I will use (and hence publish) my XML library (a simple wrapper to parse, process and generate XML, using expat) and my SQL library (a simple wrapper around mysql, or in theory other SQL back ends). I use these in most of my tools.

My main aim is to handle the Technicolor routers and probably SNOM phones too. We may make FireBrick's work as TR-069 clients too. There are loads of other things you can do with TR-069 it seems, but I suspect most people do not need more than the basic fire transfer stuff.

Seems the first step is managing the parameters sent from the device. These will go in a database. Then I move on to sending and receiving files, such as config and code updates.

The plan is that the tool will have options allowing you queue a file transfer, which will be stored in the database as a request, and do the necessary poke to the device so the transfer then happens. I'll decide more when I start coding this.

So far zero lines of code, but watch this space.

Monday, 13 February 2012

Pondering ADR

Well, the case is being evaluated by the arbitrator, we assume. They do not acknowledge anything (emails, paperwork, etc) which is a tad rude. So we assume the case is in hand. We will see.

I would hope they detail what exactly is claimed, and allow us to defend that. If not, then something is very wrong. I have asked. I got no reply.

But I am left pondering how the outcome can be anything but bad.

1. They could come back and say they should not have taken the claim and give us back our £335+VAT claim fee. That would be the best outcome. Obviously we would invoice them the cost for paper, ink, postage and time in preparing a case file if that is the case, and take them to county court when they do not pay.

2. They could make an award against us for up to £272. I.e. what we have already credited. But if the claimant is claiming less than already given the case should not have been taken. That is clear. Obviously having credited £272 we would not have to pay the claimant, but if the arbitrator should not have taken the case that sounds like we need to take them to county court for our £335+VAT case fee and costs if that happens. After all, if the claim was settled, why did they accept the claim? It would be vexatious at the very least. Of course we will also amend our credit note to the claimant to match the lower amount of the award so they have to pay us more as well. The claimant owes us a number of later invoices not involved in the ADR claim.

3. They could make an award for more, well in fact any award. We did nothing wrong (in contract) and so under law and our contract terms we owe nothing. We never guaranteed any delivery date, and anyway we met what they asked for, so there is no liability at all. If they make an award then they have not taken in to account the law and our terms, as they are required to by their terms of reference.

This last one will be interesting as the contract we have with them requires us to adhere to their final decisions made in accordance with the terms of reference, as does the OFCOM general conditions. We cannot appeal the final decision.

But, if the final decision was not made in accordance with the terms of reference, as this would be the case as our terms are clear on limits of liability, then we don't have to accept it or pay. We don't have to appeal it, we simply do not have to pay.

That will be interesting. They would have to take us to court and show they had followed their terms of reference, which would be tricky.

So, overall, it is hard to see what possible outcome there can be to the ADR which does not involve us, and the arbitrators, in the county court!

I'll let you know how it goes!

Fun day

Well, first off we decided to do a long standing change to swap aaisp.net.uk and aa.net.uk. That did not go well. More at http://aa.net.uk/news-2012-02-dns.html.

But other than that thing went well.

We worked out that normal BGP logic was not helping us on the route reflectors. If we had more than one external BGP box on the same subnet the next hop is on that subnet. If one box loses the link it still announces the subnet. That means a link failure could make a black hole. The answer is that all the edge routers need next-hop-self set to ensure we only route traffic to routers that have the working external BGP links. Fun!

That change worked like a charm - no issues - no dropped packets.

We also moved over one of our special wholesale customers which involved complex routing table logic and RADIUS changes, and, well, just worked.

We now have all transit and all peering and at least one wholesale handover on the new rack! Yay!

Next are the remaining special interconnects and moving the old BT host link over. They have their ISDN pair in now, so should be this week.

Basically going well.

And I do apologise for the DNS issues today, but aa.net.uk rules now.

Friday, 10 February 2012

Taking the day off

Well, seems I forgot to book the night off too.

BE broke around 3am so I was up most of the night updating customers.

And right now, 6pm, I am coding changes to the FireBrick, having handled loads of emails and irc questions and all sorts.

I have not played WoW at all today, which was my plan. Arrrrg!

Bloody Barclays

Once again they cause havoc with their "fraud department". Drives me around the bend.

This time, having transfered a reasonably large sum, £X, in to the account using on-line banking, we try to spend £X using a chip'n'pin card, only to be "referred". This means the merchant calling up, and it took about half an hour in the end. Eventually they talk to us on the call and check a couple of Direct Debit payment and the code on the back of the card, and authorise it.

But what a waste of bloody time.

(a) If it was fraud the fraudster had access to on-line banking to do the transfer then he has the details of Direct Debits and other transactions, so a pointless question to ask about Direct Debits we have.

(b) The card chip was used, so the real card, so checking the code on the back is pointless, its not a copy.

What is worse, apparently, having authorised the payment by talking to us and confirming we were who we said we were, they block the card as a fraud prevention measure. WTF?

Anyway, Sandra went in and gave the manager of the local branch a bollocking, and rightly so. Not amused.

Wednesday, 8 February 2012

The needle

0.25mm x 5mm
Well, now I have something to put in the sharps bin...

Once a day now, but may need to have another one with meals. They start off slowly, increasing the dose every 3 days until I am happy it is working right.

I am impressed the needle is so fine that you cannot feel it.

Insulin injector pen

Pre-filled with insulin
Where the needle goes
The needle - in a nice safe sealed package
And a second safety cap inside
Dial up the dose
Looks harmless enough

Nearly there

Well, progress again today.

We now have all our transit and peering point connections on the new rack.

We have ADSL customers connected via the new rack, and platform RADIUS to carriers from it all working.

We have the RADIUS servers and DNS servers and syslog servers in there too.

We have our new Ethernet hub in there, but not yet tested a live connection.

The main thing left now is to get the direct connections to other ISPs moved, and we have the links in place ready to do that. The delay there is co-ordinating with each interconnect separately and moving at a convenient time with little or no disruption to their services.

Once the old rack is cleared out, we have to set up the other host link in the new rack (moving from old rack), and that will be it!

All good fun.

TR-069 ACS

Apart from all the upgrades we are doing, I do have other work on as well, and one of the projects that is quite urgent is making a TR-069 server.

I am reading through the spec now - and it is going to be fun.

Of course, there are existing servers, well, at least one, open source, but I do like to re-invert the wheel if I can.

The main thing is that it will be completely in C so should be quite efficient.

Why? Well, for a start, we have these TG582n technicolor routers, and they (a) support TR-069, and (b) have a fairly crap web interface for config. So the plan is to make our control pages for DSL have a whole router config page that allows customers to config their routers centrally and it update the router via TR-069.

Apart from allowing us to make a web config page for the router, and perhaps other routers, it will also allow for replacement routers to have the exact same config, and even for routers to get the config if factory reset.

Obviously we are all about giving our end users choice, and this will be optional. However, even for our more techie customers, this is likely to appeal I expect.

So, the question is, do I make this TR-069 server open source? I may. It does mean writing it slightly differently. If it was only an in-house tool it would be integrated in to our databases and systems very tightly. But it may be more fun to make it a general purpose tool.

If any of you are interested in this, do let me know, and the main features you are looking for.

We are looking for (a) ability to send a new config to the router (b) ability to upgrade firmware on the router, and that is pretty much it!

Tuesday, 7 February 2012

There's always one!

Lots of progress today.

We have LONAP (another peering point) connected on to the new rack.

We have BE wholesale connected as well. It did mean the FireBrick team working on changes in the TCP stack, BGP code, config and documentation to add a non standard forced TTL on BGP peers, but that done - it is working.

We should be able to move the direct connection customers from tomorrow.

Of course we found a fun one - with sessions spread over two LNSs there was, of course, a customer that did not work. It turns out they were the only remaining customer using an old feature called closed user group which allows restricted access between sites. It only works where the sites are on the same LNS else no connectivity at all. And they were not on the same LNS. It is simple to fix, and just means using a prefix on their user name, so sorted. But there had to be one exception, and I knew it would either be one of Mike's customers or one of Kev's. It was one of Mike's... Well done Mike. It is funny how you can guess who the unusual configuration lines are with :-)

Anyway, we are waiting on BT to put an ISDN line in for the old host link moving to the new rack (to monitor the fibres). We have one transit feed still to move, but that will be this week. There is a good chance that in a few days time we will have nothing actually using the old rack at all.

Then we'll turn the old rack off at least a day before we start pulling kit out, just to see who screams :-)

Getting there.

Well done team.

Monday, 6 February 2012

Silly set backs today

Well, lots of silly things.

One customer is logging in using upper case on one line and lower case on the other so split between two LNSs as the system uses a hash of the username to decide. Hence my script to check if we have any sites plit over two LNSs wanted to kill their lines every 60 seconds. It is the only customer and they are changing their login. D'Oh!

We have our favorite telco that realise they did not put enough copper pairs to our rack! They need them for monitoring the fibres, but one lot is monitored using ADSL and one is monitored using ISDN. That will delay moving the old host links to the new rack.

We are getting our management LAN DSL installed, with another ISP, as you do. I won't say who, but how quaint: Paper order form. Paper DD form. So waiting for my signature in the office now. No IPv6. And this is not a small ISP...

We think we have sorted PI space customers now, but may need some more sophisticated changes to the source filtering on L2TP to manage it "neatly".

Of course, as blogged, arguments with one supplier wanting non standard BGP links. They are making an exception for now - thanks!

Sounds like most of the remaining jobs should be sorted in next few days, but nice to let things settle a bit in the mean time.

Behind schedule, as always, but we allowed lots of slack just in case.

Oh, and we are putting more bandwidth on 21CN. Lots of migrates from 20CN this month. By more I mean shit loads more... I am really trying to get my stats up to the full 100.0% no dropped packets, if I can.

So, a fun week to follow.

Non standard BGP

We have an interesting case with one of our carriers. Looks like we have worked around it for now, but it is rather odd as they are requiring non standard BGP TCP/IP in links to them.

They are requiring us to send all the BGP TCP packets with a TTL of 1

What is interesting is that some of the big routers do indeed do this, but doing so is against the recommendations for TCP/IP which recommends a TTL of 64. BGP itself makes no mention of TTL. There are Internet standards that say the TTL must be at least the Internet diameter even. So naturally, our BGP does in fact follow this standard and uses a TTL of 64.

Of course, using TTL of 1 was a silly thing anyway as anyone could spoof BGP with a TTL of 1 by setting it to a suitable higher value, though if the reply was TTL of 1 they do not get far. The issue that came up is anyone can spoof a convincing TCP RST with a TTL of one and shut down BGP sessions. This problem is now recognised in other Internet standards which document TTL security where one sets a TTL of 255 and the far end checks it is still 255. Remotely spoofing a TTL of 255 is impossible without compromising the local routers somehow. So that works. Indeed our routers support TTL security.

It seems this is some fire-walling rule, and to be honest I have never seen anyone fire-walling based on TTL. It is not clear what they are trying to protect against. They allow L2TP with normal TTLs.

A simpler firewall would be to do the same as other carriers and have access lists covering which of our IPs can talk to which of their IPs, and not fire-walling on TTL.

This is the same bunch that allowed MAC spoofing on PPPoE links to disrupt and even monitor other people's DSL lines. Thankfully they are fixing that.

It seems however they are insisting that any future services we buy must use this non standard BGP to connect to them.

It is very brave of them.

I guess following the standards is a key factor in deciding which suppliers we will use for new services.

P.S. FireBrick have, of course, modified the TCP stack and BGP and config on the FireBrick BGP routers to support this non standard mode of operation as well as standard TTL security.

Breaking new ground

There are a couple of completely new things that are being used in anger this week. We have done various tests last week, but this is for real now.
Firstly we are now running a dual live LNS system using two of the new LNSs. We have customers spread over the two LNSs based on their login. This means, in theory, bonded customers are on the same LNS. If not, then you get working service but not the bonded download. We are working on ways to pick up any that go wrong, and we think some BE lines ended up on the wrong LNS. A simple ppp-kill will move to the right one.

This is also the first time we have run the LNSs in the route-reflector so they see all other routes. They used to announce connections and use the core routers as a gateway - now they see all routes and can send via the right external gateways for outgoing traffic.

Both of these could have unexpected side effects. We have seen some with customers that have PI space (their own IP blocks).

So anyone with something odd, please let tech staff know and we can resolve the problems as they come up. We managed to sort one PI issue within minutes of being reported on a Sunday night.

In the mean time we'll get on moving the other transit, peering points and direct links over during this week.

Sunday, 5 February 2012

Upgrade progress (LINX and transit)

We have managed to move one transit and our LINX peering over now. We have all customers moved over on to the new LNSs, except the wholesale ones.

We even found why data SIMs were not showing graphs and sorted.

So has been a fun weekend.

Next week we get other transit feeds, other peering points, and a whole load of direct peering - which is going to take some co-ordinating.

The major jobs are sorted though, and all is looking very good.

You do then hit fiddly things like making sure nagios is watching the right boxes, and ensuring your cacti graphs are all running on the new boxes, and checking all the management LAN works, and the backup out-of-band access works, and the administration passwords are all set correctly with the right access lists. For the most part it is copy and paste, but you have to then test everything carefully just in case. A never ending set of silly little details.

At some point we want to go in there on a Sunday and check the dual power, which should be seamless. We also want to check that taking out a whole side of the network (turning off a switch) recovers. That will take some lines out for a few minutes we expect. We need to make a list of carefully defined tests and make sure people know we are buggering about.

Ideally, at some point, we should test turning off the whole power, and then back on, and seeing how quickly everything recovers. I am not sure if we will do that or not - it is a bit disruptive.

But if you don't test the contingencies they bite you when something does break.

We'll post details of what tests are being done when.

Abusing the system?

I blogged about abuse of MSO texts.
I blogged about people abusing ADR.

Did I ever expect to be woken by someone with an MSO text over an accounting error - paying me £3K by accident, and urgently needing it sent back so he can pay the right person. It wakes quite a few staff as it happens. It is about as far from a Major Service Outage as you can get.

And then apparently threatening that he could go to ADR* and cost me £350, twice, if I did not help. Vexatious? I very nearly ceased his bloody line on the spot when he said that.

I have sent the money back, before midnight (which was apparently his deadline to pay the right person).

I am at a loss for words to be honest. I don't know what to say. I am going back to bed now.

P.S. they are now talking on irc about using MSO texts to request pizza and taxis...

*The exact line was:-
23:35 < ydorg> having two ADR's could cost 700 quid

Saturday, 4 February 2012

Snow

Someone on irc says it is snowing. I say it is not. But to check I check the weather app on my iPad and look up where I am and it says "Light snow". Oh, apparently it is snowing.

Actually getting up and looking out of the window if very much an after thought, and it confirms that there is indeed light snow here.

What has the world come to? If the weather app and the window had disagreed - which would I have believed? :-)

Friday, 3 February 2012

Sharps box

It is a bit of a sad day when you have to get one of these, but given that my mother has been diabetic for nearly 50 years now it is not surprising I now have one.

I also have a large box of needles and some "pens" of insulin. They are quite cunning - disposable pens with 300 units of insulin pre-filled. Dial up the dose, fit a disposable needle, close your eyes, grit your teeth, and push the button... Something like that anyway...

I get my training on this next week when I start daily injections.

Ho hum...

ADR

I have to post something - I am so stressed over this - if I post on here I will feel I have parked the problem, at least for now. Sorry it is long and boring...

This should be a case for putting on our web site as a major success story. It was a company wanting to stream live video from a site in London for coverage of the Royal Wedding last year. They wanted (rather adventurously) to do it on ADSL lines, needing something like 8Mb/s uplink. We managed to get the lines in with a FireBrick doing bonding so that they had around 9Mb/s upload, and all in time for the event. They streamed videos on the day. They even paid the bills (which is not cheap for 4 PSTN lines with ADSL2+ and annex M).

What made it even more of a story is that our favourite telco messed up the records on two of the lines which meant they could not accept an order for annex M (faster upload) as they did not have the line length details. It took Shaun a hell of a lot of work to get that sorted, and he knew he was working against the clock and he managed it. There are logs of him chasing our favourite telco at 2am in some cases. It was a real case of A&A staff going the extra mile, well over and above the contractual requirements to help a customer. All of the staff involved kept them informed the whole time and handled numerous questions and changes to dates and messing around very professionally and politely.

I was pleased, and the customer said they were happy with the service even.

Then we get a dispute from them - they think that they should not have to pay the install price for the lines because of the delay getting annex M on them.

What? That makes no sense. We installed the lines and got annex M and all in time. But they say they needed time to demonstrate to potential customers, and as they did not have time then did not sell as much streaming as they expected. It does not make a lot of sense as they say each stream was 2M, so they could have demonstrated as soon as they got the first 2 lines in long before the event. The delay was getting the final two annex M upgrades. Even so, them not getting customers is hardly our fault as we didn't agree an install date. In fact we make it a clear and explicit part of the contract that we don't guarantee an install date.

We explained that it made no sense. Was he saying he has losses that happened to be exactly the same as the line install costs? Anyway we were not in breach of contract, and anyway we exclude consequential losses even if we were. Sorry...

He starts spouting implied contract terms but cannot say what and where from, and that his claim is not entirely for breach of contract but cannot say what for. He even quoted Sale of Goods Act clauses relating to equipment sales, when he had not in fact complained about the equipment supplied in any way at any time, and clearly it worked as it should have. It really made no sense at all.

Even so, we did issue a good will credit for £272 which was various of the service costs before all 4 lines were working together. We did not have to, but we are nice like that.

He seemed confused by the credit. Anyway, the final email in the dispute was mine asking him to simply and clearly list exactly what he is claiming for, why, and how much. I hear no more.

Months later - The Ombudsman Service say they have a claim. The claim says we have been unhelpful and they lost business due to the delays, and states how much they have paid. It also says how much the calculate they should have paid which is £7 less. We told the Ombudsman that £7 was clearly a frivolous claim, and already more than settled, but they are going ahead anyway?!?!

Today we sent them the case file - something like 500 pages. Good luck!

ADR is Alternative Dispute Resolution. Its required for telcos like us to be part of such a scheme allowing people to take a dispute to an arbitrator instead of the county court.

This is our first case ever. No customer has taken us to ADR or court in nearly 15 years in business, and that alone is stressing me. We strive to provide good service, and to resolve disputes fairly ourselves. How can any case go as far as ADR?


The problem is, even if the arbitrator is sensible and sees we did not breach the contract and we actually bent over backwards to get the service they wanted in time in spite of serious problems (beyond our control) from our favourite telco, and as such there is no case to answer and no award... We pay £350 for that.

Yes, even a totally bonkers case and even if the arbitrator agrees it is a totally bonkers case... we pay £350.

What the hell!?!?!?!?!

We will have to see how the case pans out. In the mean time said company has paid none of their subsequent ongoing service bills, and so we are taking them to county court! Madness.


I think, certainly for business customers, we have to have a clause requiring them to pay us if they bring an invalid case to ADR. No idea if that is enforceable, but at least it is "fair". Maybe that would go to county court if I have that clause - I wonder how a county court judge would rule on the fairness of a clause forcing a "loser pays" arbitration system when, err, that is the system the country court operates... Hmmm.

P.S. had my annual diabetic review today (see other blog post) and for the first time my blood pressure was up - I was awake half the night just stressing on this whole thing being "wrong" so I am not surprised. It is not even the money - £350 is not an issue, obviously - its the injustice that stresses me.

P.P.S. I forgot the "sound bite" type paragraph for ispreview to quote...

A&A have long had concerns over the whole ADR scheme, and this case just shows how it can be abused. A clear case of A&A falling over backwards to help a customer and go way beyond the agreed contract terms, and then having a £350 bill thrown in our faces. ADR is unfair - buy definition - it is a "one-side pays regardless" arbitration scheme unlike the much cheaper country court small claims track where loser pays. We are even tempted to offer customers a scheme where we will pay their fees up front for taking us to court rather than ADR if they have a dispute, after all such a scheme would be a tenth of the cost in most cases.

Thursday, 2 February 2012

Two steps forward, one step back

Well, today has been interesting.

First thing was that we had a report of a /16 not routing to the Internet... The result was baffling and led to finding a rather obscure bug in BGP when using route reflectors (yes, Jon, OSPF OSPF OSPF, I know).

Basically, there are reasons to ignore a route - the RFC specifies these (cluster list showing our cluster, originid being us, etc). We do this. Good.

Sadly though we actually ignore the whole update, including the incidental withdraw prefixes in the same update. Bugger...

So upgrading around 15 boxes during the day, and I am pretty sure without losing a packet - win! - we have that fixed, and all seems fine.

Now to start seriously moving stuff over. Seems a visit to site needed - one cable showing unplugged?!?; A DSL router to install (backup management LAN); and some nice environmental sensors to install. That will be tomorrow.

DNS resolvers all working - linked in to route reflectors as local versions of our published resolvers. In fact everything now linked to two core route reflectors. Yay!

Tonight I started allowing lines to new LNSs as a test - i.e. any lines that reconnect were sent to new LNSs. We had tested a lot. We got Be, BT 20CN and BT 21CN on line and working... Good!

Then a snag - at least one wholesale L2TP customer did not route back to us on the new LNSs. Some worked, some did not. So job for tomorrow is chase them all to ensure routing all in place and allowing new LNS IP addresses through firewalls, etc. Fun!

So lines back to existing LNSs for now. If we can sort that tomorrow we can move everyone at the weekend.

We will probably set up at least one transit and one peering link on new kit tomorrow as well. Should be pretty simple and low risk (we always say that).

Still - progress...

Well, someone has to test it

I was thinking that the blog is not a bad way to explain a bit about how the network upgrade is going at A&A. We have the status pages, which are fine (well, maybe not, they need some work), but I can probably say a bit more here...

So, where are we?

The good news is that this is all just happening. The crew working on this are actually doing a good job planning and designing and, well, making it happen. They have some key deadlines they are working to, but so far everything is going pretty well. I almost feel like a director for a change, rather than an engineer. Not sure if that is scary or good.

The fun with the network this week was rather unfortunate, but yesterday we did set up the pair of route reflectors and connected almost everything up (couple of DNS servers to go). What we did do is connect the existing routers and LNS to them as well. This means we have one core network (over the old and new racks) and can start moving things.

One of the main things was testing the new link to our favourite telco. This is the primary reason for all of this - so that we can operate more than a gigabit of traffic. Initially we will be running with four gigabit fibres in to them and up to two gigabit of traffic. The load can use any of the four links in any combination which nicely allows us to run three live LNSs at well below capacity and have a fourth as a backup in case any fail. Right now we have less than a gigabit of traffic and everything can run through one LNS. The trick is to make sure that any one failure, and ideally even two failures, do no push any link or any router over capacity. The new rack can expand with more LNSs and routers to around 4 gigabit of traffic before we have to rethink things and that is probably quite a few years of expansion.

So, new host link works, and Paul has been testing his home line last night. He found and fixed a few MTU issues, but yes, it works! Well done.

But right now he as a whole rack, ten FB6000 series gigabit routers, two gigabit fibre links to the telco, and his one FTTC home line using it.

Of course, whilst this does seem like overkill, it makes no actual difference. Things go as fast as his line, as normal... We do, after all, aim not to be the bottleneck. Just amusing to think of all of that infrastructure and capacity for one line.

But it means we can do a simple LNS switch to move customers over, and get the old host link moved to the new rack so we have all four gigabit feeds. It looks like we have managed to get links via different floors (above and below us) in to the telco as well, which is good for redundancy. We also have them via different technologies (WES and EAD) so different termination kit in the rack. All of the kit in the rack is dual power and there are separate incoming power feeds.

The other good news is that this new rack also has a link for Ethernet customers. That means we can offer Ethernet via London and Maidenhead for even more redundancy. That is a link to be tested soon as well.

I'll post more on here as we make progress - but this week is key. From now on it is basically plugging things in and moving things over, and testing testing testing.

As for host names, there are a few changes. We are keeping the telco links (LNSs) as gormless (can't think of a better name), though there is a/b/c/d of them now. We are changing the edge routers from armless to aimless, an old name we used to use for edge routers, and there are a/b/c/d of them too. We have Ethernet edge routers which are core route reflectors called weightless. We still have doubtless and careless used for testing, direct L2TP and data SIMs.

The good news is that FB6000's use under 30W when running flat out, so no issues with power usage and keeping things cool. It is a very "green" network that we run.

Watch this space...

Wednesday, 1 February 2012

Ooops

Well, what can I say - sorry to customers for the blip Tuesday evening. In fact there were a few "issues" in the late afternoon and then something of a more major "blip" lasting around 15 minutes just before 7pm.

So time to 'fess up as to what actually happened. It was us this time.

As per planned work notices we are in the middle of a major network upgrade. We have 10 shiny new routers/LNSs in the new rack and we are gradually moving things over. We want to ensure we are not the bottleneck and this means a bigger network that goes over a gigabit on various links.

One of the first steps is bringing these new routers on to our existing network. This means establishing some internal BGP links. Once this is done we can move various of the external links from one part of the network to the other in controlled steps. We are using IBGP not OSPF for various historical reasons and to date it has done what we want perfectly - we understand BGP quite well (or so we thought).

However, the main downside of IBGP is you have to mesh all of your routers. Not a problem when you have 4 of them, but when you have 10, and when connecting to the existing 4, that is a lot of BGP sessions. This is why internal routing protocols like OSPF win in such cases.

However, not a problem, we'll use route reflectors - they allow internal routing to be relayed within the internal network avoiding having to fully mesh the routers.

This is where the fun starts. Even though we have people working on this that have used BGP before joining A&A, and even though I coded the BGP in the routers myself, carefully following the RFCs, including route reflector logic, we have not actually used route reflectors in anger before.

Well, now we know - the trick is not to make a loop of route reflectors. The problem you get is a route gets injected in to this loop and then it sticks. Even if you withdraw the original announcement the loop sees its own copy (reflection) of that announcement from another route reflector and so keeps announcing it. They also tell your edge routers about the route!

To add to the fun, if you have anything not set up right in setting the next hop, you can end up with routes that go to places that don't know what to do with them (black holes).

Re-reading the RFCs this is actually quite simple, and the next test will follow the guidelines somewhat better. We will not have a loop of route reflectors but a pair of them, and the edge routers will be normal IBGP to them. This will allow the redundancy, simplicity and scalability that we want. We have fixed the next hop set up as well.

To be honest, this was a silly mistake, and one we won't make again. The impact was the minor issues in the afternoon. The actual issues were very hard to pin down as they meant some routes were broken and some were iffy (taking the wrong path in some way) but over all traffic levels stayed the same so clearly not a major issue generally. Thanks to the customers that reported what they were seeing.

Then we come to the bigger outage of 15 minutes or so. This was part of simply tidying up after the earlier problems and making the routing configs consistent. Again, a very low risk activity. We are still trying to get to the bottom of that though as it should not have caused an issue. The fix was "have you tried turning it off and then back on again" in that we reset the LNS completely, clearing all of the BGP sessions and starting from scratch. One of the jobs we still have to do this morning is trawl the logs to find why things got messed up. I would love to have spent more time tracking the problem as it happened, but getting things workings was somewhat more important. The symptoms were damn strange as sessions appeared to start up but have issues with RADIUS, even though RADIUS was apparently working and there was no apparent reasons for the sessions to have gone down in the first place. The reset meant we lost graphs for the day, which is always a nuisance.

Anyway, today's job will be carefully planning the next stages and deploying them very carefully and slowly.

Of course, and I am sure some customers will be asking this, why the hell is this not done at 3am on a Sunday morning or something? Well, yes, if this was work that was going to take out service, it would be. This is, however, work that should not actually impact service at all - it is very routine low risk stuff. It is also a case where the impact of something not being right is hard to see. If we had done this over night it would not be until 9am when a few customers say there is some "odd routing" that we would find this issue - everything looked fine when we did it!

In general we find between 5pm and 6pm to be a good time for some of this "at risk" work as it is after most business customers are finished (not all, we know), but is before the home users start (mostly) and at a time when people are still around to tell us if they can see anything not quite right.

Over night work is ideally suited to cases that take out part of the network - where the work is simple mechanical stuff - moving cables and the like - where those working on it can see they have done the job right immediately and there is nothing new. Telco work that takes out network links is scheduled for over night for this very reason.

The end result of all of this will be much more capacity in our network, and some major increases in bandwidth to our favourite telco... So sorry for the inconvenience, and we really will try not to break it like this again. Thanks for your patience.

P.S. it does seem odd not blaming our favourite telco for something. After all, over the last week we have seen BRASs reset and take out services for hundreds of customers for similar periods, but we are all kind of used to that...

P.P.S. A simple loop of route reflectors is not enough to break things - you need an ordinary IBGP link in between to lose the cluster ID.