In today’s always-on, work from anywhere world, connectivity and performance are everything. The network is the foundation of getting work done. When the network experiences performance issues, businesses — their customers, partners, and employees — all suffer. A dropped video call can prevent a sale from moving through. An error message when loading a small business storefront web page can negatively impact the customer experience and the business brand image. Whatever the industry, connectivity is a crucial component of success.
On October 14, 2021, one of our main transit providers suffered a severe network issue, which impaired its transatlantic connectivity across the globe. Although the transit provider disruption lasted approximately 12 hours, Cisco Umbrella customers experienced virtually no interruption. Just a few minutes after the problem started, Umbrella’s systems automatically mitigated the internal data packet loss by rerouting traffic over different providers to avoid trouble spots. After that, our automation allowed us to completely remove that transit provider from the path between our customers and our Umbrella security services.
Umbrella’s cloud-native architecture was built for moments like this. Here’s how we mitigated disaster for our customers and kept them from experiencing significant downtime, disruption, and data loss.
Mitigating Outages With Cisco Umbrella’s Self-Healing, Highly Automated Architecture
We noticed the transit provider outage on our monitoring systems right away. Over two dozen Cisco Umbrella data centers (and other internet service providers (ISPs) with whom we partner) were using that transit provider to connect to the internet. Immediately, almost all sites started seeing full data packet loss and were not reaching their intended destinations.
But almost just as quickly, after that initial spike, the data packet loss levels across the Umbrella global cloud architecture dropped to normal levels.
Sophisticated automation, built and run by expert engineers, saved the day. We designed the system to have complete visibility into all the combinations available to route internal traffic. When it detects that the current one is no longer the best (either in terms of packet loss or latency), it picks another combination of ISPs and changes the routing accordingly. This agile and flexible architecture enables us to continuously deliver new capabilities seamlessly to our customers, without business downtime, even in the face of connectivity crises across interconnected transit providers, ISPs, content delivery networks, and more.
Designing Solutions to Keep Network Performance Disruptions From Turning Into Disasters
Even though the connectivity and network performance problems with the transit provider remained unresolved for 12 more hours, customer traffic pointed to the Cisco Umbrella IP address got right back on track. This occurred because our engineering teams have developed a variety of tools to ensure extraordinary resilience and performance, and two of them are pivotal to keeping our customers up and running.
For most cases, when a transit provider or other ISP has a disruption or network performance issue, we automatically reroute traffic away from any of the affected sites. An automated system (dubbed the “Transit Terminator”) detects the issue and shuts the Border Gateway Protocol (BGP) session down automatically. This is ideal for a scenario where the disruptions are confined to a relatively contained number of site locations.
However, for scenarios like this one, where a very large number of Umbrella sites are impacted by a transit provider or other ISP problem, the Transit Terminator can lead to site overload and data degradation on the remaining Umbrella sites, so it’s not the best long-term solution. For wide-scale disruptions like this, we built a different tool, the “ISP global shutdown tool.” In this case, we needed to completely remove the faulty provider from our Umbrella customers’ paths. To do this, we needed to shut down the BGP sessions exactly at the same time on all the sites where the transit provider was present. By doing this, the entire network for that provider would lose all the direct routes towards our IP prefixes at the same time, and the traffic would get spread across all Umbrella sites, without overloading any specific one.
Building an automated tool to handle this exact circumstance saved time, manpower, and errors. Most importantly, it prevented our Umbrella customers from directly experiencing network performance issues related to the transit provider meltdown. Within minutes of the event, the on-call engineer diagnosed the scope of the issue and used the ISP global shutdown tool to select the specific transit provider network for which we needed to stop all sessions. The traffic through that ISP went to zero immediately, keeping traffic flowing through the Umbrella network on providers that were up and operational.
Cisco Umbrella’s battle-hardened architecture is built and run by an experienced team with decades of experience spanning security, networking, cloud-native architecture, threat research, data science, and more. We applaud their dedication and determination to prevent chaos for our customers.
Want to Learn More?
We have resources aplenty discussing how Cisco’s global cloud architecture delivers network resiliency and reliability. We’ve also written an article that outlines another instance where Cisco Umbrella protected customers from outages.
We also provide a deep dive into our engineers’ analysis of mitigating the disruption — with all the technical details — that you can access after a quick registration.
And if you’re ready to see what Cisco Umbrella can do for your organization? Sign up for a free 14-day trial today!
Start a free trial today
Block more threats, speed incident response, and improve internet performance.