We have had some issues today, both last night, and for a few of hours during the day today on and off. It looked a bit like a denial of service (DoS) attack via LINX, but it seems was not actually intentional!
Basically, somehow, a major content provider with which we peer suddenly thought we were a transit route for a chunk of their traffic and flooded our peering and hammering some of our transit to get the traffic to its actual destination (another country even!).
We've been working with them to try and understand and fix this. We are very sure we are not announcing someone else's blocks by mistake to them. They confirmed they could not see any layer 3 route to us for the traffic. So nothing looks wrong! But they are sending the traffic and it was enough to cause packet loss on LINX for us. Best guess is a hardware issue on their router.
I'll update more on here when we get to the bottom of it - we have had to shut down the peering, meaning no diagnostics can really be done to find the underlying cause, which is a nuisance. At some point we have to re-establish peering and do testing to confirm if fixed.
Even with peering down, we are still seeing bursts of traffic, but dropping the peering has helped, oddly. With peering down, this means it is some sort of layer 2 (Ethernet) issue. At least, when there is traffic, we stand a chance of diagnosing the problem.
Whilst packet loss on a major external link is a problem, it does not usually have that much of an issue on normal access to web pages, email, etc. For a start, it is only the one link, and we have many. But it does have quite an impact on VoIP services which mostly go over peering links. We immediately redirected some of our VoIP routing to try and avoid this, but some calls were still via LINX and suffering break up in audio.
Obviously we need to have a serious look at ways we can cater for this sort of issue in future - ultimately we have very little control of things going wrong "in the Internet", and it is almost impossible to pre-plan every possible contingency. None the less, we will try and learn from this.
I picked a good day to be off sick!
Update: I would like to thank my engineering team (Paul, Andrew, Jimi) for working on this all day and on in to the evening in their own time, and the guys form LINX as well. It seems the exchange itself (LINX) is not at fault in any way, which is good news. Some additional steps with the route server do seem to have stopped the bursts of traffic as of around 5pm Friday, and we intend to leave things like this until the peer can investigate further.
Update: When another LINX peer suffering the same issue contacted the offending peer this morning (Saturday), they immediately reset the card facing LINX and fixed it. One wonders why they would not do that when we reported it yesterday, shame. It does however confirm this was not "just A&A" being affected.
Subscribe to:
Post Comments (Atom)
One Touch Switching
It has been some weeks since One Touch Switching was fully live. TOTSCO say over 100,000 switch orders now, so it is making good progress, ...
-
Broadband services are a wonderful innovation of our time, using multiple frequency bands (hence the name) to carry signals over wires (us...
-
For many years I used a small stand-alone air-conditioning unit in my study (the box room in the house) and I even had a hole in the wall fo...
-
It seems there is something of a standard test string for anti virus ( wikipedia has more on this). The idea is that systems that look fo...
ReplyDeleteThis is why I don't have a default route or full routing table on my LINX routers, so at least the traffic would not go anywhere. I used to log traffic not to networks I announced so I could see if people were trying it on...
Thanks for the detail, hoping this was cause of my poor VoIP quality on Friday.
ReplyDeleteWith regards to your second update, is it possible that you were the first to notice, and then another LINX peer also noticed on Saturday and contacted the offender?
ReplyDeleteWhen only one LINX peer is affected, it's believable that it's the peer at fault in some weird and wonderful fashion; when two peers are affected, you know it's you at fault.