What caused today’s Internet hiccup
Like others, you may have noticed some instability and general sluggishness on the Internet today. In this post we’ll take a closer look at what happened, including some of the BGP details!
At around 8am UTC Internet users on different mailing lists, forums and twitter, reported slow connectivity and intermediate outages. Examples can be found on the Outages mailing list company support site such as liquidweb and of course on Nanog.
How stable is the Internet
So how do we know if the Internet was really unstable today? One way to look at this is by looking at the outages visible in BGP over the last 12 months. On average we see outages for about 6,033 unique prefixes per day, affecting on average 1470 unique Autonomous Systems. These numbers are global averages and it’s worth noting that certain networks or geographical areas are more stable than others.
If we look at the number of detected outages by BGPmon today we see outage for 12,563 unique prefixes affecting 2,587unique Autonomous Systems. This is well above the daily average and indeed both the unique prefixes and the unique Autonomous Systems count are the highest we’ve seen in the last 12 months. Based on this data it is fair to say that Internet did indeed suffer from more hiccups than normally.
The graph below shows the number of autonomous systems that were affected by an outage just over the last month. Note that most outage occur during week days and that the Internet is in fact much more stable during the weekend and holidays.
What caused the instability
Folks quickly started to speculate that this might be related to a known default limitation in older Cisco routers. These routers have a default limit of 512K routing entries in their TCAM memory. The BGP routing table size depends on a number of variables. The primary contributors are Internal, or local routes (more specific and VPN routes used only in the local AS) and External routes. These External routes are all the routes of everyone on the Internet, we sometimes refer to this as the global routing table of Default Free Zone (DFZ). For most networks on the Internet today the global routing table is the major contributor, while local routes vary from a few dozen to a few hundred. Only a small percentage of the networks have a few thousand Internal routes.
The size of the global routing table today (August 12 2014) is about 500k, this number marginally varies per provider. Let’s compare a few major ISP’s. The table below shows the number of IPv4 prefixes received per provider on a full feed BGP session.
|Provider||Location||Number of IPv4 routes on full BGP feed|
The table above shows that depending on the provider and location the numbers differ slightly, but are indeed very close. It also shows that right now the number of prefixes is still several thousands under the 512,000 limit so it shouldn’t be an issue. However when we take a closer look at our BGP telemetry we see that starting at 07:48 UTC about 15,000 new prefixes were introduced into the global routing table.
Note that the timestamps of the introduction of these new prefixes overlap with the reports of hiccups on the Internet today. The picture below shows the number of prefixes received by a Level3 customer in Chicago and shows how the prefix count peaked at around 8am UTC to 515,000 prefixes and then dropped back to 500,000 prefixes. Clearly visible as well is that the 15,000 new prefixes were only visible for a few minutes.
What caused the increase in prefixes.
Now that we know that there was indeed an increase in prefixes it is time to look into where these prefixes came from. Looking at the our data we quickly see that the new prefixes being announced at that time were almost all originated by the Verizon Autonomous systems 701 and 705. All of the new routing entries appear to be more specific announcements for their larger aggregate blocks. For example BGPmon detected 170 more specific /24 routes for the larger 18.104.22.168/16 block.
So whatever happened internally at Verizon caused aggregation for these prefixes to fail which resulted in the introduction of thousands of new /24 routes into the global routing table. This caused the routing table to temporarily reach 515,000 prefixes and that caused issues for older Cisco routers.
Luckily Verizon quickly solved the de-aggregation problem, so we’re good for now. However the Internet routing table will continue to grow organically and we will reach the 512,000 limit soon again. The good news is that there’s a work-around for those operating these older cisco routers. The 512,000 route limitation can be increased to a higher number, for details see this Cisco doc.