How the Internet in Australia went down under
This Wednesday for about 30 minutes many Australians found themselves without Internet access. All these users were relying either directly of indirectly on the Telstra network, which at that point was isolated from the Internet. This story quickly hit the local headlines, in this blog we’ll look at the technical details of this event and what the cause of this outage likely was. Telstra is one of Australia’s major Internet providers. It normally originates approximately 500 Ipv4 prefixes and 3 Ipv6 prefixes. Telstra also provides Transit for many ISP’s and enterprises such as for example AS38285 ‘Dodo’ an Australian ISP and AS10235 ‘National Australia Bank’. So how could such a large provider go down, surely it has lots of redundant hardware and multiple connections in and out of the country? As it turns out Wednesday’s outage was caused by a routing error many network engineers have first hand experience with, a simple routing leak. A routing leak can happen when small ISP X buys transit from ISP A and also from ISP B. ISP X receives a full bgp routing table from A and because of incorrect filtering relays these messages to ISP B. As a result ISP B now learns all Internet routes via ISP X to ISP B and ISP X (the customers) now became an upstream provider for ISP B. The above is likely what happened last Wednesday between Telstra en Dodo (AS38285). Dodo a Telstra customer, re-announced all Internet routes to Telstra, which because it prefers customer routes now thinks the best way to the Internet is through Dodo. This post on the Ausnog mailings list shows how Telstra was using Dodo (a customer) as transit to reach a network in India. This is not a new zero day attack scenario or anything like it. Instead it’s probably the number one mistake when configuring BGP routing. I remember when I was just learning about BGP my mentor always used to tell me.. Filter, Filter, Filter, filter!! Which is exactly what didn’t happen here. Because it is so easy to accidentally leak routes in BGP you have to explicitly define filters that prevent this. In this case Dodo should have had filters to make sure they would only announce their prefixes and Telstra should have had these filters as well to prevent hijacks but more importantly to protect its own infrastructure. In this case these filters did not seem to be in place, which allowed this leak to happen. However, this alone should not have brought down all of Telstra's International connections. So what happened? It’s likely that Telstra now tagged all routes learned from Dodo (all 400,000 of them) as customer routes and faithfully announced this to all of its peers and upstream providers. As keeping large filters up to date can be tedious we often see large providers use a mechanism known as max prefix limits. Instead of explicitly defining which prefixes to allow the number of prefixes expected plus some extra is set as the maximum number of prefixes allowed. This is useful to prevent a sudden spike in announcements, often caused by leaks. In case the limit is reached the BGP session is brought down to prevent the leak from spreading. This is what likely happened with Telstra’s upstream providers. Telstra announced a full Internet routing table to its upstream providers, triggering the max prefix limit causing the BGP sessions to go down and as a result isolating Telstra and its customers from the rest of the Internet. This outage is clearly visible when we look at the Telstra routes in the Internet BGP routing tables. The graph on the left clearly shows how all of Telstra’s prefixes suddenly disappeared at 2:40 UTC and came back approximately 30 minutes later when the leak problem was resolved. Our analysis shows that this affected approximately 1400 prefixes. This includes Telstra and Telstra customer networks and accounts for about 10% of all the Australian IPv4 prefixes. Worth pointing out is that non of the IPv6 prefixes were affected. This is because it’s common to use different BGP sessions for IPv4 and IPv6. In this case only the IPv4 session went down leaving the IPv6 networks unaffected. And so it could happen that one simple customer leak can bring down a complete carrier network. So today's lesson is Filter Filter Filter!