At OpenDNS, terabytes of data flow in and out everyday. It takes creativity and solid data science skills to innovate ways to transform an enormous amount of data into security discoveries.
In a series of blogs, we will share algorithmic details of our home-grown discovery algorithms that leverage big data, machine learning and graph theory to make predictive detections.
We start off with SecureRank, which has been presented at a couple security community meet-ups recently. If you haven’t seen it, or prefer video to text, you can watch the SecureRank presentation at this YouTube link.
Conceptually, a good abstraction of DNS traffic is a bipartite graph. It’s huge, even if you are looking at a small fraction of the Internet. The left set of vertex consists of the client machines (requestors) making DNS requests; the right set of vertex are the hostnames (requestees) that are requested.
So how does a DNS bipartite graph give us an edge in making security discoveries? Consider, for example, the websites Google.com and example.evil.com (a made-up botnet C&C hostname). Imagine that Google.com is much more popular than example.evil.com, as the Google.com vertex appears more densely connected on the graph. The popularity rank can be inferred using page rank-like algorithms. However, popularity alone is a not a good indicator of maliciousness.
What about layering the graph with existing knowledge? There are known bad actors on the Internet, as well as websites we know to be clean. Imagine that we color our graph with different shades of “red” to indicate the bad ones, while “green” for the good ones. We’re likely to see clusters of ‘red’ zones and they are quite separate from “green” zones with few intra links.
We generalize the above rationale and observations as one hypothesis “hostnames requested by known infected clients but never requested by clean clients are most likely to be bad.” Based on this hypothesis, SecureRank ranks the security risks of all hostnames by applying an iterative process similar to the page rank algorithm in the following steps.
Link analysis: This involves marching through global DNS query data and mapping the requestor-requestee pairs as a graph.
Initialization: This gives negative ranks to known blacklisted domains and assigns positive ranks to known whitelisted domains.
SecureRank iterations: This concept is illustrated in the simple example below.
Convergence and output final ranks: Final ranks are generated when the ranks converge after a number of iterations.
Implementation: The algorithm is implemented in Java following Hadoop MapReduce framework. Details can be found in the YouTube video mentioned earlier.
Empirical results evaluation
Two metrics were used to evaluate the results: false positive rate and false negative rate. The data sets we used (as the closest thing to gold standard positive set and negative set) are composed of a list of 670,199 known malicious sites and the Quantcast top million sites.
Our hypothesis has stated that domains that receive low SecureRank, in particular, in the negative value range, are more likely to be malicious. At the same time, malicious sites should more likely receive a negative rank, but not necessarily (consider compromised legitimate sites).
The resulting SecureRanks of 83% of the blacklisted sites are negative.
530 out of the Quantcast top million sites (potential false positive rate = 0.053%) received negative ranks in a particular time window. Further scrutiny of these potential false positives showed that: 1) these sites are on the left end axis of negative values, all in the range of [-0.099,0). 2) Forty five sites were also flagged by Google safe browsing. 3) The majority of these sites are parked domains or spam/scam suspects. None of the sites can be deemed as significant false positives.
SecureRank in action!
SecureRank is integrated into the Umbrella Security Graph. When a query has an alarmingly low SecureRank, an alert bar is shown.
Using SecureRank in our back-end classification engine, we’re alerted to 200,000+ likely malicious domains each day. The majority of these sites exist in the gray to black spectrum of the Internet, meaning they include spam, scams, botnets or malware. The large-scale discovery technique is used in combination with our other intelligence and predictive algorithms to preventively protect OpenDNS customer from known and unknown security risks.