One of the key challenges for OpenDNS (now part of Cisco) is handling a massive amount of DNS queries and simultaneously running classification models on them as fast as possible. Today, we’re going to talk about Avalanche, a real-time data processing framework currently used in our research cluster.
First, we have to run some numbers to evaluate the amplitude of our requirements and make smart architecture design decisions. Second, we will assess some similarities with other technical fields (such as quantitative trading in finance) that share very similar problematics, and see if we can find a common ground. Finally, we will expose some details and key elements of the Avalanche project.
Evaluating the traffic
Before jumping into any design or implementation, we need to take a look at the amount of traffic that OpenDNS sees during slow and peak hours every day. I decided to take a look at the log collection from one of our resolvers located in Amsterdam. This resolver is comprised of several machines that handle DNS queries, of which we’re going to consider only one (m1). It also sees various types of traffic, but here we will only focus on authoritative traffic (Authlogs) and recursive traffic (Querylogs). It is important to mention that the Amsterdam resolver, despite its significant amount of traffic, is not the biggest nor the smallest of our resolvers.
Let’s take a look :
Amsterdam being at GMT + 1 hour, noon is the peak moment of the day. Whereas the traffic around midnight is typically very slow. The upper part of the table shows the amount of data contained in each log chunk. Important to note here: each resolver produces one chunk every 10 minutes. The lower part of the table exposes the same information converted to queries per second.
- For Authlogs, we observe a variation of 686.75 to 941.25 queries per second.
- For Querylogs, we observe a variation of 5525.26 to 10246.66 queries per second.
Put differently, authlogs peak at one message every 1.062 milliseconds, where querylogs peak at one message every 97 microseconds. These figures give a better understanding of the traffic volume and the precision level we’re dealing with here. Again, keep in mind that we have more resolvers all around the world all comprised of several machines averaging around these numbers.
Analogy with the finance world
Knowing now that we have to build a log processor that will handle hundreds of data centers and apply active classification/decision techniques to the microsecond level on each one, it’s pretty hard to not see a correlation with quantitative trading in finance. (Also known as algorithmic trading).
In fact, the similarity is pretty striking. We see volumes of queries; financial analyst firms see volumes of trades. We have to deal with time series; they do too. Our data is centered around domain names (strings), and they have ticker symbols. Our logs contain additional data about the queries (Authlogs and Querylogs have different information); they see additional info about each trade. Our job consists of analyzing traffic patterns to detect anomalies and apply enforcement decisions and actions to Internet traffic; they do the same thing to compute investment strategies. They run back testing on simulations to measure the efficacy of their strategies; we replay historical logs to do pretty much the same (confusion matrix, hit rates, etc.). And finally, financial firms use external indicators (e.g., sentiment analysis on the news), while we take into account third-party APIs to cross-check our results.
Obviously, there are differences also. We see a virtually infinite number of domains; they have a limited number of stock symbols. We apply classification algorithms on the domain name itself (e.g., DGAs, typo-squatting…), and for financial institutions, it’s not really relevant to do so. They also don’t really have visibility over the originators of a trade; we see the client IPs and ASNs in our logs. However, despite some important differences, it’s important to realize that a lot of modern companies that manage heavy traffic loads have to solve very similar problems, involving very similar infrastructure and will therefore go for comparable design decisions.
Choosing a robust messaging system
The foundation of our architecture relies on choosing the right messaging library. This will be a key factor to determine how well our data processing pipeline will perform. I will save you from all the details of a descriptive comparison between all the technology, but ZeroMQ is one of the best messaging middleware available, and it is very well known in the finance world. It also provides an amazing paradigm to build a distributed system with different message passing patterns.
During my analysis, some important metrics caught my attention :