• Skip to primary navigation
  • Skip to main content
  • Skip to footer

Cisco Umbrella

Enterprise network security

  • Contact Sales
  • Login
    • Umbrella Login
    • Cloudlock Login
  • Search
Search
  • Why Us
    • Why Cisco Umbrella
      • Why Try Umbrella
      • Why DNS Security
      • Why Umbrella SASE
      • Our Customers
      • Customer Stories
      • Why Cisco Security
    • Fast Reliable Cloud
      • Global Cloud Architecture
      • Cloud Network Status
      • Global Cloud Network Activity
    • Unmatched Intelligence
      • A New Approach to Cybersecurity
      • Interactive Intelligence
      • Cyber Attack Prevention
      • Umbrella and Cisco Talos Threat Intelligence
    • Extensive Integrations
      • IT Security Integrations
      • Hardware Integrations
      • Meraki Integration
      • Cisco Security for Chromebook
  • Products
    • Cisco Umbrella Products
      • Cisco Umbrella Cloud Security Service
      • Recursive DNS Services
      • Cisco Umbrella SIG
      • Umbrella Investigate
      • What’s New
    • Product Packages
      • Cisco Umbrella and Cisco Secure Access Packages
      • – DNS Security Essentials Package
      • – DNS Security Advantage Package
      • – SIG Essentials Package
      • – SIG Advantage Package
      • Umbrella Support Packages
      • Cisco Umbrella for Government Packages
    • Functionality
      • DNS-Layer Security
      • Secure Web Gateway
      • Cloud Access Security Broker (CASB)
      • Cloud Data Loss Prevention (DLP)
      • Cloud-Delivered Firewall
      • Cloud Malware Protection
      • Remote Browser Isolation (RBI)
    • Man on a laptop with headphones on. He is attending a Cisco Umbrella Live Demo
  • Solutions
    • SASE & SSE Solutions
      • Your SSE journey with Cisco
      • Cisco Umbrella SASE
      • Secure Access Service Edge (SASE)
      • What is SASE
    • Functionality Solutions
      • Web Content Filtering
      • Secure Direct Internet Access
      • Shadow IT Discovery & App Blocking
      • Fast Incident Response
      • Unified Threat Management
      • Protect Mobile Users
      • Securing Remote and Roaming Users
      • Umbrella and Duo Layered Protection
    • Network Solutions
      • Guest Wi-Fi Security
      • SD-WAN Security
      • Off-Network Endpoint Security
    • Industry Solutions
      • Government and Public Sector Cybersecurity
      • Financial Services Security
        • – FTC Safeguards Rule Compliance 2023
      • Cybersecurity for Manufacturing
      • Higher Education Security
      • K-12 Schools Security
      • Healthcare, Retail and Hospitality Security
      • Enterprise Cloud Security
      • Small Business Cybersecurity
  • Resources
    • Content Library
      • Top Resources
      • Research Reports
      • Case Studies
      • Videos
      • Datasheets
      • eBooks
      • Solution Briefs
      • Cybersecurity Webinars
    • International Documents
      • Deutsch/German
      • Español/Spanish
      • Français/French
      • Italiano/Italian
      • 日本語/Japanese
    • Security Definitions
      • What is DNS Security
      • What is a Secure Web Gateway
      • What is a Cloud Access Security Broker (CASB)
      • What is Security Service Edge (SSE)
      • What is Secure Access Service Edge (SASE)
      • Cyber Threat Categories and Definitions
    • For Customers
      • Support
      • Customer Success Webinars
      • Free Trial Quick Start Guide
      • Free Trial Help and Tips
  • Trends & Threats
    • Market Trends
      • Generative AI Cybersecurity Risks and Rewards
      • Hybrid Workforce
      • Rise of Remote Workers
      • Secure Internet Gateway (SIG)
    • Security Threats
      • How to Stop Phishing Attacks
      • Malware Detection and Protection
      • Ransomware is on the Rise
      • Cryptomining Malware Protection
      • Cybersecurity Threat Landscape
      • Global Cyber Threat Intelligence
    •  
    • Woman connecting confidently to any device anywhere
  • Partners
    • Channel Partners
      • Partner Program
      • Become a Partner
    • Service Providers
      • Secure Connectivity
      • Managed Security for MSSPs
      • Managed IT for MSPs
    •  
    • Person looking down at laptop. They are connecting and working securely
  • Blog
    • News & Product Posts
      • Latest Posts
      • Products & Services
      • Customer Focus
      • Feature Spotlight
    • Cybersecurity Posts
      • Security
      • Threats
      • Cybersecurity Threat Spotlight
      • Research
    •  
    • Register for a webinar - with illustration of connecting securely to the cloud
  • Contact Us
  • Umbrella Login
  • Cloudlock Login
  • Free Trial
Clearing search keywords
Products & Services

The Network Resiliency Strategy for the Umbrella Infrastructure is Key to Our Business Growth and World-Class Innovation

Author avatar of Pier Carlo ChiodiPier Carlo Chiodi
Updated — February 24, 2023 • 6 minute read
View blog >

At Umbrella, we choose to expect the unexpected and continue to improve from the lessons learned. We plan for failures as an unavoidable natural occurrence. This is critical in allowing us to build a resilient infrastructure to guarantee the highest uptime and user experience for our customers.

To achieve such brilliant goals, we invest time and effort into a global resiliency strategy founded on multiple layers: a worldwide footprint of more than 40 hyperconnected data centers, a network that can automatically detect issues and work around them, and services designed to fail over gracefully – all without impairing the end-user experience.

In this blog, we’ll cover some aspects related to how resiliency is implemented at the infrastructure level, focusing on the network.

Anycast, multiple data centers and service discovery

At a high-level, the entire infrastructure and service stack is based on the key principle that every single datacenter may go down at any time. This concept of an “ephemeral infrastructure” allows Umbrella teams to design services that have resiliency hard coded in their DNA.

The extensive use of anycast, backed up by a worldwide presence of data centers, allows the design of applications that leverage dynamic service discovery mechanisms. This helps us determine the best datacenter to use for a given customer, among those which are available at that given time. When the datacenter is removed from production (either for maintenance or an unplanned event) applications perform new service discovery and land the user at a different datacenter.

This combination of similar dynamic service discovery mechanisms at the application side, and smart orchestrators on the backend allow the implementation of services that can quickly detect the unavailability of a specific datacenter and fail-over into one of our other data centers. Here’s a more detailed description of the key elements of our network resilience strategy.

Upstream connections

Our data centers are connected to the rest of the Internet via multiple upstream connections. At most of our data centers, we connect to at least three different Internet Service Providers (ISPs) separately, along with one or more connections to local Internet Exchanges (IXes).

ISP selection

We select ISPs among global menu of tier-1 providers that are present at the facility where our datacenter is located. Tier-1 providers are those big players in the Internet ecosystem who are very well connected to all the other networks and that do not have to buy transit to have a complete Internet routing table.

We put meticulous care into the selection of the ISPs when we build a datacenter. For example, to meet our resiliency criteria, we perform a thorough analysis of the physical paths the fibers would need to go through to connect our devices to an ISP. We eventually select the combination of ISPs that guarantee us the best diversity of paths. For the network, we can say that the first layer of resiliency is on the ground path diversity.

Another important aspect of the ISP selection process is the analysis of how well each providers network is. This process is not based on public datapoints, but rather on internal tracking and tooling, as well inferences and estimates done by our expert senior network engineers, we strive to understand how well an ISP is connected to the local access network providers that our customers use in the region.

This preliminary analysis helps us understand how many different potential paths we would be able to offer to the end-users to connect to our network and the quality of these paths. If we detect an issue in one of them, we can re-route the traffic over a different path with no impact to the users’ experience. The outcome of this analysis is a picture of how each ISP among those available in a facility would help to improve the diversity of the connectivity between our datacenter and our customers.

Last, but not least is the capacity we provide for our upstream connections. For instance, we want to be able to absorb DDoS attacks without having our customers even notice them. Hence, we buy capacity that allows the network to handle those situations transparently. We monitor the utilization of our links constantly and proactively start upgrading them when safety thresholds are reached. This continuous focus on expanding our upstream circuits as the demand grows allows us to operate a network with many terabits per second of global capacity. Also, we perform regular fire-drills to test our mitigation procedures and be prepared to quickly react when similar events happen.

Internet exchanges(IXes)

Another key component of our network resiliency model is the wide presence of peering relationships across the IXes we join.  We adopt an extensive high-quality peering policy that allows us to dramatically increase the diversity and stability of Internet paths towards our end-users and towards the major content providers and SaaS applications. We spend considerable time in the analysis of the quality that every peering relationship would bring to us before we accept or propose. The outcome is a mesh of high-quality paths between our datacenters and customers, which guarantees the ability to perform smooth fail-over of traffic with no impact on the customer experience.

Routing security

We also evaluate resilience based on routing security. Every IP network connected to the Internet may be subject to routing incidents that can potentially lead to traffic disruption. Malicious activities or even unintentional human errors on the wider Internet lead, almost on a daily basis, to situations where traffic destined for a specific destination (a customer, or a content provider) is hijacked and routed somewhere else. The Umbrella network implements several best practices to reduce this eventuality to the minimum, like RPKI Origin Validation, which helps to ensure packets flow correctly to and from our customers as intended.

Tools and automation

We also rely on automation tools that we have built and continue to enhance to detect and react to events which may be potentially disruptive for our customers. Even a small level of packet loss on a link could impair the user experience of our customers, so it is imperative for us to quickly detect and apply a mitigation accordingly.

The Network Reliability Engineering team has built a system that continuously monitors the health of our upstream circuits from multiple vantage points. This system can automatically deploy changes to the network devices in a way that traffic can be steered away from them and smoothly moved over the healthy datacenters.

We created distributed software to actively measure all the circuits from each datacenter towards all the ISP circuits on all the other datacenters. This software allows us to have full visibility of the packet loss level and latency over the full-mesh matrix of the circuits we operate.

The data points gathered by this component are used by two other components of the system, one being an optimizer that automatically steers the site-to-site traffic over the best path between two datacenters. As soon as a path experiences packet loss, or has a latency that is non optimal, the tool announces a signal to the network and overrides that path, forcing the traffic over a better-performing circuit.

The second is the Transit Terminator. Unlike the optimizer, which is focused on site-to-site traffic, The Transit Terminator takes care of the traffic towards and from the outside of our network, e.g., customers and content/SaaS providers. When we detect packet loss in an ISP link, Transit Terminator removes it from production, so that the traffic can be re-balanced over the remaining links. All these checks and mitigations are performed automatically, in a continuous fashion, without human intervention. In the event of an ISP outage that impacts multiple data centers at the same time, we then rely on Global Shutdown Tool, which we have built and continue to enhance over the years.

Criticality of user experience

While the aforementioned tools provide great value in the mitigation of wide infrastructure-level issues, another area of focus is on how to improve the end user’s experience and network resiliency on a more granular basis. To achieve that, we built additional tooling that detects and mitigates issues on a smaller scale. This tool, still under active development, already offers an automatic outbound congestion detection and mitigation mechanism.

Regardless of how much upstream circuits are over-provisioned, unpredictable social or tech-related events can lead to spikes of traffic being pushed through those links. When that happens, the utilization level of those circuits might reach warning thresholds, and could potentially become congested, causing contention and packet loss. This is when the traffic engineering tool kicks in and automatically detects when the usage level of a circuit goes above a given warning threshold. The tool performs an analysis that leads to a re-routing proposal for certain IP prefixes. After the proposal is reviewed and approved by one of our engineers, the tool automatically injects some traffic engineering routes into the network, which leads to the reduction of the traffic on the original circuit and helps avoid saturation and potential degradation of the end-user experience.

Future enhancements

We are always looking for ways to improve network resiliency and the quality of the Umbrella infrastructure overall. One area we are evaluating is the ability to detect sub-optimal performance of concerned traffic destined towards specific hosts. While other tools can detect issues on our transit provider links, this tool will be able to determine which ISP is the best to use to push traffic toward a customer and enforce its usage on the network. This tool will help go even deeper into the resiliency offered by our network so that we can uncover and mitigate issues that are specific to just a subset of destinations.

Stay tuned for more details in future blogs.

[Umbrella has] a global resiliency strategy founded on multiple layers ...more than 40 hyperconnected data centers, a network that can automatically detect issues and work around them, and services designed to fail over gracefully – all without impairing the end-user experience.

Post this quote

Additional Resources

  • Signup for a free trial

Suggested Blogs

  • Cisco Umbrella for Government: DNS Security Integrated With CISA Protective DNS August 29, 2024 4 minute read
  • Cisco Umbrella: A Leader in the GigaOm Radar for DNS Security June 26, 2024 3 minute read
  • Go Big & Go Chrome: Strengthen Cybersecurity in Education, the Enterprise & Beyond March 28, 2024 5 minute read

Share this blog

FacebookTweetLinkedIn
Subscribe to the Cisco Umbrella blog Subscribe

Follow Us

Facebook X LinkedIn Youtube

Footer Sections

What we make

  • Cloud Security Service
  • DNS-Layer Network Security
  • Secure Web Gateway
  • Security Packages

Who we are

  • Global Cloud Architecture
  • Cloud Network Status
  • Cloud Network Activity
  • OpenDNS is now Umbrella
  • Cisco Umbrella Blog

Learn more

  • Webinars
  • Careers
  • Support
  • Cisco Umbrella Live Demo
  • Contact Sales
Umbrella by Cisco
208.67.222.222+208.67.220.220
2620:119:35::35+2620:119:53::53
Sign up for a Free Trial
  • Cisco Online Privacy Statement
  • Terms of Service
  • Sitemap

© 2025 Cisco Umbrella