At Umbrella, we choose to expect the unexpected and continue to improve from the lessons learned. We plan for failures as an unavoidable natural occurrence. This is critical in allowing us to build a resilient infrastructure to guarantee the highest uptime and user experience for our customers.
To achieve such brilliant goals, we invest time and effort into a global resiliency strategy founded on multiple layers: a worldwide footprint of more than 40 hyperconnected data centers, a network that can automatically detect issues and work around them, and services designed to fail over gracefully – all without impairing the end-user experience.
In this blog, we’ll cover some aspects related to how resiliency is implemented at the infrastructure level, focusing on the network.
Anycast, multiple data centers and service discovery
At a high-level, the entire infrastructure and service stack is based on the key principle that every single datacenter may go down at any time. This concept of an “ephemeral infrastructure” allows Umbrella teams to design services that have resiliency hard coded in their DNA.
The extensive use of anycast, backed up by a worldwide presence of data centers, allows the design of applications that leverage dynamic service discovery mechanisms. This helps us determine the best datacenter to use for a given customer, among those which are available at that given time. When the datacenter is removed from production (either for maintenance or an unplanned event) applications perform new service discovery and land the user at a different datacenter.
This combination of similar dynamic service discovery mechanisms at the application side, and smart orchestrators on the backend allow the implementation of services that can quickly detect the unavailability of a specific datacenter and fail-over into one of our other data centers. Here’s a more detailed description of the key elements of our network resilience strategy.
Our data centers are connected to the rest of the Internet via multiple upstream connections. At most of our data centers, we connect to at least three different Internet Service Providers (ISPs) separately, along with one or more connections to local Internet Exchanges (IXes).
We select ISPs among global menu of tier-1 providers that are present at the facility where our datacenter is located. Tier-1 providers are those big players in the Internet ecosystem who are very well connected to all the other networks and that do not have to buy transit to have a complete Internet routing table.
We put meticulous care into the selection of the ISPs when we build a datacenter. For example, to meet our resiliency criteria, we perform a thorough analysis of the physical paths the fibers would need to go through to connect our devices to an ISP. We eventually select the combination of ISPs that guarantee us the best diversity of paths. For the network, we can say that the first layer of resiliency is on the ground path diversity.
Another important aspect of the ISP selection process is the analysis of how well each providers network is. This process is not based on public datapoints, but rather on internal tracking and tooling, as well inferences and estimates done by our expert senior network engineers, we strive to understand how well an ISP is connected to the local access network providers that our customers use in the region.
This preliminary analysis helps us understand how many different potential paths we would be able to offer to the end-users to connect to our network and the quality of these paths. If we detect an issue in one of them, we can re-route the traffic over a different path with no impact to the users’ experience. The outcome of this analysis is a picture of how each ISP among those available in a facility would help to improve the diversity of the connectivity between our datacenter and our customers.
Last, but not least is the capacity we provide for our upstream connections. For instance, we want to be able to absorb DDoS attacks without having our customers even notice them. Hence, we buy capacity that allows the network to handle those situations transparently. We monitor the utilization of our links constantly and proactively start upgrading them when safety thresholds are reached. This continuous focus on expanding our upstream circuits as the demand grows allows us to operate a network with many terabits per second of global capacity. Also, we perform regular fire-drills to test our mitigation procedures and be prepared to quickly react when similar events happen.
Another key component of our network resiliency model is the wide presence of peering relationships across the IXes we join. We adopt an extensive high-quality peering policy that allows us to dramatically increase the diversity and stability of Internet paths towards our end-users and towards the major content providers and SaaS applications. We spend considerable time in the analysis of the quality that every peering relationship would bring to us before we accept or propose. The outcome is a mesh of high-quality paths between our datacenters and customers, which guarantees the ability to perform smooth fail-over of traffic with no impact on the customer experience.
We also evaluate resilience based on routing security. Every IP network connected to the Internet may be subject to routing incidents that can potentially lead to traffic disruption. Malicious activities or even unintentional human errors on the wider Internet lead, almost on a daily basis, to situations where traffic destined for a specific destination (a customer, or a content provider) is hijacked and routed somewhere else. The Umbrella network implements several best practices to reduce this eventuality to the minimum, like RPKI Origin Validation, which helps to ensure packets flow correctly to and from our customers as intended.
Tools and automation
We also rely on automation tools that we have built and continue to enhance to detect and react to events which may be potentially disruptive for our customers. Even a small level of packet loss on a link could impair the user experience of our customers, so it is imperative for us to quickly detect and apply a mitigation accordingly.
The Network Reliability Engineering team has built a system that continuously monitors the health of our upstream circuits from multiple vantage points. This system can automatically deploy changes to the network devices in a way that traffic can be steered away from them and smoothly moved over the healthy datacenters.
We created distributed software to actively measure all the circuits from each datacenter towards all the ISP circuits on all the other datacenters. This software allows us to have full visibility of the packet loss level and latency over the full-mesh matrix of the circuits we operate.
The data points gathered by this component are used by two other components of the system, one being an optimizer that automatically steers the site-to-site traffic over the best path between two datacenters. As soon as a path experiences packet loss, or has a latency that is non optimal, the tool announces a signal to the network and overrides that path, forcing the traffic over a better-performing circuit.
The second is the Transit Terminator. Unlike the optimizer, which is focused on site-to-site traffic, The Transit Terminator takes care of the traffic towards and from the outside of our network, e.g., customers and content/SaaS providers. When we detect packet loss in an ISP link, Transit Terminator removes it from production, so that the traffic can be re-balanced over the remaining links. All these checks and mitigations are performed automatically, in a continuous fashion, without human intervention. In the event of an ISP outage that impacts multiple data centers at the same time, we then rely on Global Shutdown Tool, which we have built and continue to enhance over the years.
Criticality of user experience
While the aforementioned tools provide great value in the mitigation of wide infrastructure-level issues, another area of focus is on how to improve the end user’s experience and network resiliency on a more granular basis. To achieve that, we built additional tooling that detects and mitigates issues on a smaller scale. This tool, still under active development, already offers an automatic outbound congestion detection and mitigation mechanism.
Regardless of how much upstream circuits are over-provisioned, unpredictable social or tech-related events can lead to spikes of traffic being pushed through those links. When that happens, the utilization level of those circuits might reach warning thresholds, and could potentially become congested, causing contention and packet loss. This is when the traffic engineering tool kicks in and automatically detects when the usage level of a circuit goes above a given warning threshold. The tool performs an analysis that leads to a re-routing proposal for certain IP prefixes. After the proposal is reviewed and approved by one of our engineers, the tool automatically injects some traffic engineering routes into the network, which leads to the reduction of the traffic on the original circuit and helps avoid saturation and potential degradation of the end-user experience.
We are always looking for ways to improve network resiliency and the quality of the Umbrella infrastructure overall. One area we are evaluating is the ability to detect sub-optimal performance of concerned traffic destined towards specific hosts. While other tools can detect issues on our transit provider links, this tool will be able to determine which ISP is the best to use to push traffic toward a customer and enforce its usage on the network. This tool will help go even deeper into the resiliency offered by our network so that we can uncover and mitigate issues that are specific to just a subset of destinations.
Stay tuned for more details in future blogs.