Yesterday some Internet users would have seen issues with their Internet connectivity, experiencing slowness or parts of the Internet as unreachable. This incident hit users in Japan particularly hard and it caused the Internal Affairs and Communications Ministry of Japan to start an investigation into what caused the large-scale internet disruption that slowed or blocked access to websites and online services for dozens of Japanese companies.
In this blog post we will take a look at the root cause of these outages, who was affected and what networks were involved.
Starting at 03:22 UTC yesterday (aug 25) followers of @BGPstream would have seen an increase in alerts involving Google. The BGPstream alerts were informing us that Google was announcing the peering lan prefixes of a few well known Internet exchanges. This in itself is actually a fairly common type of incident and typically indicates something isn’t quite right within the networks hijacking those prefixes and so these alerts were the first clues that something wasn’t quite right with Google’s BGP advertisements.
A closer look at our data shows not only BGP hijack incidents but also a high number of BGP leak events. A random example is this one: 126.96.36.199/17 announced by AS45629 (Jastel out of Thailand), which all of a sudden became reachable with Google as a provider for Jastel. To demonstrate this let’s look at some of the example paths (not an exclusive list):
1103 286 701 15169 45629
13335 9498 5511 701 15169 45629
202140 29075 5511 701 15169 45629
52342 20299 262206 701 15169 45629
If we take a closer look at the AS paths involved starting at the right side, we see the prefix was announced by 45629 (Jastel) as expected. Since Jastel peers with Google (15169) that’s the next AS we see. The next AS in the path is 701 (Verizon) and this is where it is getting interesting as Verizon has now started to provide transit for Jastel via Google. Verizon (701) then announced that to several of it’s customers, some of them very large such as KPN (286) and Orange (5511). So by just looking at 4 example paths we can see it hit large networks in Europe, Latin America, the US, and India (9498 Airtel).
In the example above we can see how Google accidentally became a transit provider for Jastel by announcing peer prefixes to Verizon. Since verizon would select this path to Jastel it would have sent traffic for this network towards Google. Not only did this happen for Jastel, but thousands of other networks as well.
Google is not a transit provider and traffic for 3rd party networks should never go through the Google network. Jastel has a few upstream providers and with the addition of Google and Verizon to the path, it’s likely only Verizon customers (which is still significant) would have chosen this path and only those that had no other alternative or specifically prefered Verizon over shorter paths. However this is just the start.
A word about traffic engineering
Google is one of the largest (CDN) networks in the world. It has an open peering policy and is extremely well connected with many peers. It’s also the source of a large amount of traffic with popular websites such as Youtube, Google search, Google Drive, Google Compute, etc. As a result many networks exchange a significant volume of traffic with just Google and those with direct peering with Google will want to make sure Google picks the right peering link with them. So as result large networks will start to deploy traffic engineering tricks to make sure traffic flows over the correct peering links with Google. The most powerful trick in the book is to start de-aggregating and announce more specifics. This means no matter the AS path length or whatever local-pref Google sets locally, the more specific prefixes are always preferred.
A unique insight into Google’s network
Since Google essentially leaked a full table towards Verizon, we get to peek into what Google’s peering relationships look like and how their peers traffic engineer towards Google. Analyzing this data set we find many more specific prefixes. Meaning prefixes that are not normally seen in the global Internet routing table (DFZ) and only made visible to Google for traffic engineering requirements. Let’s take a look at an example.
The prefix 188.8.131.52/24 is not normally seen on the Internet, instead it is announced as the larger aggregate 184.108.40.206/12 by AS4713 NTT OCN, the largest service provider in Japan.
During the time of the incident we see over 20,000 new OCN prefixes, all more specifics of their larger aggregate blocks (mainly their /11, /12’s, /13’s, 14’s and /15’s). In this case OCN announced these more specific prefixes primarily to control how traffic comes in from Google. Now that Google leaked these prefixes to Verizon as well, everyone seeing announcements for these prefixes would have sent traffic for this prefix towards Verizon and Google, essentially changing the local traffic engineering trick into a much more global traffic engineering setup.
Verizon customers and peers that would have seen this announcement would have preferred this over any other path since more specifics always win.
Size and impact of this incident
If we look at what networks were impacted the most we can see that AS4713 NTT OCN, the largest service provider in Japan was impacted most severe. Our data shows over 24,000 new more specific prefixes for OCN were visible via Google and Verizon during the time of the incident.
We also saw over 7,000 new more specifics for AS7029 (Windstream). The total list of new (mostly more specifics) is around 50,000. For those interested, the top 30 affected networks can be found below.
All of these leaks were visible between 03:22 UTC and 03:33 UTC, with some peers seeing the leaked paths till about 04:00 UTC. Or in local Japan time: 12:22 PM and 1:01 PM.
|Number of new prefixes
via Google and Verizon
||OCN - NTT Communications Corporation
|| Windstream Communications Inc
|| Uninet S.A. de C.V.
|| Taiwan Academic Network (TANet) Information Center
|| Vodafone GmbH
|| ARTERIA Networks Corporation
|| CLARO S.A.
|| China TieTong Telecommunications Corporation
|| Orange Espagne S.A.U.
|| Telecentro S.A.
|| NSS S.A.
|| TELCOINABOX PTY LTD
|| Instituto Costarricense de Electricidad y Telecom.
|| Com Hem AB
|| Compañía Dominicana de Teléfonos, C. por A. - CODETEL
|| CABLEVISION S.A.
|| KPN B.V.
|| TDS TELECOM
|| Bulsatcom EAD
|| Tata Communications
|| Syscon Infoway Pvt. Ltd.
|| SaveCom Internation Inc.
|| Wideband Networks Pty Ltd, Transit AS
|| Viewqwest Pte Ltd
|| china tietong Shandong net
|| Prima S.A.
|| Cisco Webex LLC
|| Cabovisao, televisao por cabovisao, sa
In total we saw over 135,000 prefixes visible via the Google - Verizon path. Widespread outages, particularly in Japan (OCN) were because of the more specifics, causing many networks to reroute traffic toward verizon and Google which likely would have congested that path or perhaps hit some kind of acl, resulting in the outages.
Many BGPmon users would have seen an alert similar like the one below, informing them new prefixes were being originated and visible global.
New prefix for AS14061 (Code: 60)
Detected new prefix: 220.127.116.11/19
Update time: 2017-08-25 03:25 (UTC)
Detected by #peers: 18
Announced by: AS14061 (Digital Ocean, Inc.)
Upstream AS: AS15169 (Google Inc.)
ASpath: 18356 38794 45796 2516 701 15169 14061
Monitoring is one simple thing operators can do to quickly detect this and take action. In this case the recommended course of action would have been to shutdown the peering sessions with Google.
BGP leaks continue to be a great risk to the Internet's stability. It’s easy to make configuration mistakes that can lead to incidents like this. In this case it appears a configuration error or software problem in Google's network led to inadvertently announcing thousands of prefixes to Verizon, who in turn propagated the leak to many of its peers.
Since it is easy to make configurations errors, it clearly is a necessity to have filters on both sides of an EBGP session. In this case it appears Verizon had little or no filters, and accepted most if not all BGP announcements from Google which lead to widespread service disruptions. At the minimum Verizon should probably have a maximum-prefix limit on their side and perhaps some as-path filters which would have prevented the wide spread impact.