We have had some issues today, both last night, and for a few of hours during the day today on and off. It looked a bit like a denial of service (DoS) attack via LINX, but it seems was not actually intentional!
Basically, somehow, a major content provider with which we peer suddenly thought we were a transit route for a chunk of their traffic and flooded our peering and hammering some of our transit to get the traffic to its actual destination (another country even!).
We've been working with them to try and understand and fix this. We are very sure we are not announcing someone else's blocks by mistake to them. They confirmed they could not see any layer 3 route to us for the traffic. So nothing looks wrong! But they are sending the traffic and it was enough to cause packet loss on LINX for us. Best guess is a hardware issue on their router.
I'll update more on here when we get to the bottom of it - we have had to shut down the peering, meaning no diagnostics can really be done to find the underlying cause, which is a nuisance. At some point we have to re-establish peering and do testing to confirm if fixed.
Even with peering down, we are still seeing bursts of traffic, but dropping the peering has helped, oddly. With peering down, this means it is some sort of layer 2 (Ethernet) issue. At least, when there is traffic, we stand a chance of diagnosing the problem.
Whilst packet loss on a major external link is a problem, it does not usually have that much of an issue on normal access to web pages, email, etc. For a start, it is only the one link, and we have many. But it does have quite an impact on VoIP services which mostly go over peering links. We immediately redirected some of our VoIP routing to try and avoid this, but some calls were still via LINX and suffering break up in audio.
Obviously we need to have a serious look at ways we can cater for this sort of issue in future - ultimately we have very little control of things going wrong "in the Internet", and it is almost impossible to pre-plan every possible contingency. None the less, we will try and learn from this.
I picked a good day to be off sick!
Update: I would like to thank my engineering team (Paul, Andrew, Jimi) for working on this all day and on in to the evening in their own time, and the guys form LINX as well. It seems the exchange itself (LINX) is not at fault in any way, which is good news. Some additional steps with the route server do seem to have stopped the bursts of traffic as of around 5pm Friday, and we intend to leave things like this until the peer can investigate further.
Update: When another LINX peer suffering the same issue contacted the offending peer this morning (Saturday), they immediately reset the card facing LINX and fixed it. One wonders why they would not do that when we reported it yesterday, shame. It does however confirm this was not "just A&A" being affected.