This is a crosspost from our recent acquisition of BGPmon posted here.
Earlier today a massive route leak initiated by Telekom Malaysia (AS4788) caused significant network problems for the global routing system. Primarily affected was Level3 (AS3549 – formerly known as Global Crossing) and their customers. Below are some of the details as we know them now.
Starting at 08:43 UTC today June 12th, AS4788 Telekom Malaysia started to announce about 179,000 of prefixes to Level3 (AS3549, the Global crossing AS), whom in turn accepted these and propagated them to their peers and customers. Since Telekom Malaysia had inserted itself between these thousands of prefixes and Level3 it was now responsible for delivering these packets to the intended destinations.
This event resulted in significant packet loss and Internet slow down in all parts of the world. The Level3 network in particular suffered from severe service degradation between the Asia pacific region and the rest of their network. The graph below for example shows the packet loss as measured by OpenDNS between London over Level3 and Hong Kong. The same loss patterns were visible from other Level3 locations globally to for example Singapore, Hong Kong and Sydney.
At the same time the round trip time between these destination went up significantly as can be seen in the graph below.
By just looking at the number of BGP messages that BGPmon processed over time as can been seen in the graph below, there’s a clear start where all of a sudden the number of BGP updates increased. When we look closer at the data it becomes clear that this increase in BGP messages starts at 08:43 UTC and aligns exactly with the start of the leak and the start of the packet loss issues. At around 10:40 we slowly observed improvements and at around 11:15 UTC things started to clear up.
Let’s look at an example.
An example affected prefix is 188.8.131.52/24 which is one of the Facebook prefixes. The AS path looked like this
1103 286 3549 4788 32934
If we look at this path we see that AS32934, Facebook, is the originator of the prefix. Facebook peers with 4788 and announced it to its peer Telekom Malaysia (AS4788) which in turn announced it to Level3 (AS3549) which announced it to all of its peers and customers, essentially giving it transit and causing a major routing leak.
Because Telekom Malaysia did this for about 176,000 prefixes they essentially signalled to the world that they could provide connectivity for all these prefixes and as a result attracted significantly more traffic than normally. All this traffic had to be squeezed through their interconnects with Level3. As a result all this traffic was now being routed via Level3 and Telekom Malaysia was likely to hit capacity issues, which then resulted in the severe packet loss issues as users reported on Twitter and as we’ve shown with the data above.
The 176,000 leaked prefixes are likely all Telekom Malaysia’s customer prefixes combined with routes they learned from peers. This would explain another curious increase in the number of routes Level3 announced during the leak time frame.
The graph below shows the number of prefixes announced by Level3 to its customers. Normally level3 announces ~534,000 prefixes on a full BGP feed. These are essentially all the IP networks on the Internet today. Interestingly during the leak an additional 10,000 prefixes were now being observed. One explanation for this could be that these are more specific prefixes announced by peers of Telekom Malaysia to Telekom Malaysia and are normally supposed to stay regional and not visible via transit.
Since Level3 was now announcing many more prefixes than normally, it would have hit Max prefix limits on BGP session with its peers. These peering sessions with other large tier1 networks carry a significant portion of the worlds Internet and the shutdown of these session would cause traffic to shift around even more and exacerbate the performance problems as well as causing even more BGP churn.
So in conclusion, what we saw this morning was a major BGP leak of 176,000 prefixes by Telekom Malaysia to Level3. Level3 erroneously accepted these prefixes and announced these to their peers and customers. Starting at 8:39 and lasting for about 2 hours traffic was being redirected toward Telekom Malaysia, which in many cases would have been a longer route and also caused Telekom Malaysia to be overwhelmed with traffic. As a result significant portions of traffic were dropped, latency increased and users world wide experienced a slower Internet service.