DNS is an essential infrastructure service on the Internet, and is used both by benign and malicious applications. As security products are shifting from signature-based detection to predictive models, DNS traffic has become an important indicator for discovering infected clients as well as servers used to send spam, spread malware and control botnets.
DNS traffic is especially well-suited for botnet detection; this has been the subject of many research papers featuring case studies with impressive results.
One of the very first models we built was a simple classifier using a dozen network and lexical features inspired by these papers. And its performance was impressive out of the box. False positives were nonexistent.
A paper that was presented recently at IFIP TC11 caught our attention. Called “Mentor: Positive DNS Reputation to Skim-Off Benign Domains in Botnet C&C Blacklists” by Nizar Kheir, Frederic Tran, Pierre Caron, and Nicolas Deschamps, the paper describes creating a model to identify false positives in a list of command and control domains, using the whois data, the popularity and the HTML code on the first page of the website. In the specific case study described in the paper, this classifier performs extremely well — perhaps too well.
However, when our research team applied this model to our global, real-world data set, it did not perform well. We discovered that the model was overfitting the data, because the writers used the Alexa Top Sites list, which contains the most popular websites. This presented skewed results because they overfit the sample lists.
To understand why these models did not perform well on real world data, we have to look at how the models were trained and tested.
Training and testing a classifier requires a set of domain names assumed to be malicious, as well as a set of domain names assumed to be benign. Public feeds, lists and services are typically used to build the malicious data set.
However, building a list of domains believed to be benign is more challenging. A common practice is to assume that popular websites are likely to be benign, and thus many researchers use Alexa top sites to constitute the benign training set. Alexa is a fantastic service, and offering a list of top 1,000,000 websites for free has proven to be very valuable to many studies and projects — however, it might not be the best choice for building classifiers based on DNS data. Domain names are used for more than websites. Mobile applications, APIs, mail servers, are not necessarily present in the Alexa list, yet they can’t be ignored.
For example, we are observing a massive amount of DNS queries for these domain names even though they are missing from the Alexa list:
- windowsupdate.com (Microsoft Windows update service)
- trafficmanager.net (Microsoft Azure Traffic Manager)
- adobetag.com (Adobe Analytics platform)
- samsungosp.com (Samsung push server API)
Domains from the Top Alexa list also tend to have very distinctive features. They typically use a dedicated hosting infrastructure, services are well-separated, domain names have been registered for a long time, and they are popular by different metrics (e.g. traffic, client diversity, back links, page rank).
In contrast, traditional command and control servers tend to use hosting services or run different services on the same host. They frequently use newly registered domains, and very few, if any, references to traditional command and control servers are observed on other websites.
In this context, a classifier leveraging some of these features is bound to perform extremely well, until tested with non-Alexa domains, as well as compromised domains. Domain names not present in the Alexa list of most popular domains are not necessarily malicious. And fortunately, the vast majority is not.
Thus, training and testing classifiers and reputation systems using non-Alexa domains is critical to performing a real-world evaluation of a model.
Unfortunately for the research community, there are not many public alternatives to Alexa. Quantcast also offers a high-quality list of top sites for free, but the same caveats of selection bias apply.
Since 2009, OpenDNS has been providing a high-quality list of phishing sites for free. The Phishtank database is constantly being updated. It can be downloaded in different formats and is trusted by many research projects, products, and services.
Today, we are thrilled to announce the public availability of two new public domain lists for researchers.
The OpenDNS Top Domains List is the top 10,000 domain names our resolvers all over the globe are receiving queries for, sorted by popularity. The popularity is defined as the number of unique client IPs having looked up a domain over a 1 hour period. Domain names that we flagged as being used to serve or control malware are removed from the list.
The OpenDNS Random Sample List is a random sample of 10,000 domain names. Similar to the OpenDNS Top Domains List, domains that we flagged as suspicious are not present in the list, that can be used as a benign data set.
Both lists are in public domain, available on Github, and are automatically updated weekly.
They are not meant to replace other public lists in any way. The Top Domains List, in particular, is solely based on DNS queries and doesn’t reflect the popularity of websites.
However, these lists have been useful to train, test and improve our own models, and by sharing them, we hope that other researchers will find them a useful addition as well.