Routing security incidents happen – for many network operators probably at least once a month, and probably close to 5% of those incidents have real (and negative) network impacts. And though overall the routing system tends to be pretty quiet, some networks have really bad days sometimes. These are some of the results from the Routing Resilience Survey Report we just released from our pilot project to analyze routing security incidents and their real-world impacts.
At the end of 2013 we launched, in partnership with BGPmon, a pilot project called the Routing Resilience Survey. As I explained in a November 2013 blog post announcing the project, the effort was focused on collecting incident data related to routing resilience from operators’ points of view. This approach allowed us not only to filter out false positives – for instance, legitimate configuration changes – but also to record the impact and severity of the incidents.
The pilot ran for a little more than six months, from November 2013 until June 2014, with 30 operators from Tier 1, Tier 2/3, cloud and content delivery networks, enterprises, and other types of ISPs from all around the world. Because we allowed the participants to also classify events related to their customers’ networks, the Survey represented 239 autonomous systems in total.
Over the course of the project, we collected more than 2000 potential routing security incidents!
Participants were asked to respond to BGPmon’s weekly summaries of potentially harmful events related to participants’ networks that the monitoring system detected. Were these legitimate incidents, and what were their impacts? Was it a monitoring system or a customer call that alerted the operator to the incident, or did it go unnoticed? These were some of the questions we asked in order to classify the events.
Full study findings are presented in the report at https://dev.internetsociety.org/resources/doc/2014/routing-resiliency-survey-report/
The results are not surprising, but they can help us understand some of the challenges to global deployment of routing security protocols. The main conclusions of the report are:
- Incidents with real impact are rare.
- There is a high percentage of false positives.
- Incidents are fixed quite quickly.
More specifically, we found that the routing security problem is not perceived as critical by many network operators. It is very interesting to see the difference in impact assessment between an operator’s and an external observer’s point of view. And quite frankly, neither is probably right – the truth is most likely somewhere in between. According to the report, “while network operators are aware of the vulnerabilities of the routing system, risks associated with them are perceived as low. In such circumstances, reactive measures seem to be more appropriate and proactive protection is deployed only if it has low operational costs associated with it.”
However, relying only on reactive measures definitely does not offer sufficient protection. Many, if not most, routing incidents happen outside the operator’s control, and resolving the incident often requires help from other network operators.
And, as we have stated plenty of times, deploying simple routing resilience measures like those outlined in the “Mutually Agreed Norms for Routing Security” (MANRS) document can help your network and many networks around you. And if done collectively by network operators around the globe, it can help the entire Internet.
Please read the report and let us know what you think!