[load-javascript slug=”mathjax”]
The infection chain for serving a single piece of malware is frequently made of many, constantly-changing domains. The security community finds thousands of new sites serving malware or acting as intermediaries every day. Hosts used to control botnets are also constantly changing in order to be resilient to takedowns.
In this context, we need to discover and block new suspicious domains as soon as possible. In order to do so, we use different models, each of them capturing different sets of domains. Once we have evidence of a server distributing malware or acting as a command-and-control server, the first thing we usually do is try to find other domains used, or soon-to-be-used, by the same malware.
From DNS queries to a discovery algorithm
Let’s take a look at the data we have, and how we use it, from a given domain name, to discover new and possibly related domains.
The log files we get from DNS resolvers are unstructured text files, containing the responses we send to clients (in this snippet, IPs have been made up):
2013-07-18 13:59:26.397060500 9.2 74.234.12.159 74.234.12.159 208.67.222.222 normal 0 - www.youtube.com. 1 0 - 0 800000 0 com.youtube m1.wrw 2013-07-18 13:59:26.397062500 9.2 74.234.14.200 74.234.14.200 208.67.222.222 normal 0 - dnl-01.geo.kaspersky.com. 1 0 - 0 102000000 0 com.kaspersky m1.wrw 2013-07-18 13:59:26.397063500 9.2 74.231.231.2 74.231.231.2 208.67.222.222 normal 0 - jelcz.sexibl.com. 1 0 - 0 0 0 com.sexibl m1.wrw 2013-07-18 13:59:26.397065500 9.2 47.136.196.21 47.136.196.21 208.67.222.222 normal 0 - outerlink.com. 15 0 - 0 0 0 com.outerlink m1.wrw 2013-07-18 13:59:26.397066500 9.2 93.47.212.8 93.47.212.8 208.67.222.222 normal 0 - aol.com. 15 0 - 0 80A40000 0 com.aol m1.wrw 2013-07-18 13:59:26.397085500 9.2 91.74.34.126 91.74.34.126 208.67.222.222 normal 0 - img.zszywka.pl. 1 0 - 0 0 0 pl.zszywka m1.wrw 2013-07-18 13:59:26.397086500 9.2 122.162.60.1 122.162.60.1 208.67.222.222 normal 0 - rtb.metrigo.com. 1 0 - 0 0 0 com.metrigo m1.wrw 2013-07-18 13:59:26.397087500 9.2 213.199.198.182 213.199.198.182 208.67.220.220 normal 0 - www.ogame.com.ar. 1 0 - 0 800 0 ar.com.ogame m1.wrw 2013-07-18 13:59:26.397088500 9.2 213.199.198.182 213.199.198.182 208.67.222.222 normal 0 - www.ogame.com.ar. 1 0 - 0 800 0 ar.com.ogame m1.wrw 2013-07-18 13:59:26.397092500 9.2 176.111.228.151 176.111.228.151 208.67.222.222 normal 0 - osne.ws. 1 0 - 0 0 0 ws.osne m1.wrw 2013-07-18 13:59:26.397093500 9.2 92.61.45.119 92.61.45.119 208.67.222.222 normal 0 - alt3.gmail-smtp-in.l.google.com. 1 0 - 0 800000 0 com.google m1.wrw 2013-07-18 13:59:26.397094500 9.2 47.136.211.87 47.136.211.87 208.67.220.220 normal 0 - mx.dca.untd.com. 1 0 - 0 102000000 0 com.untd m1.wrw 2013-07-18 13:59:26.397099500 9.2 47.136.211.87 47.136.211.87 208.67.222.222 normal 0 - yahoo.com. 15 0 - 0 800000 0 com.yahoo m1.wrw 2013-07-18 13:59:26.397100500 9.2 93.127.85.16 93.127.85.16 208.67.222.222 normal 0 - mx-1.naver.com. 1 0 - 0 1000840000 0 com.naver m1.wrw 2013-07-18 13:59:26.397101500 9.2 91.253.53.128 91.253.53.128 208.67.220.220 normal 0 - 2.s.dziennik.pl. 1 0 - 0 40000 0 pl.dziennik m1.wrw 2013-07-18 13:59:26.397102500 9.2 74.234.15.41 74.234.15.41 208.67.220.220 normal 0 - ssl.gstatic.com. 1 0 - 0 0 0 com.gstatic m1.wrw 2013-07-18 13:59:26.397105500 9.2 37.47.206.18 37.47.206.18 208.67.222.222 normal 0 - nickel.champlain.edu. 1 0 - 0 200000000 0 edu.champlain m1.wrw
Basically, the only information we have for each query is an approximate timestamp, a client IP address, a query type and the name.
Unlike web sites, there are no explicit links between resources. And unlike the HTTP protocol, the DNS protocol doesn’t provide any tracking information like refer(r)ers or cookies.
An intuition, though, is that temporal proximity can be used to predict how related two domain names might be.
Cleaning the data
While data preparation might not sound like the most exciting step of an algorithm, it actually plays a critical role in this case.
Remember that we don’t have any reliable way to identify the device that sent a query. All we see is an IP address. But home users are frequently assigned a dynamic IP address.
In addition, many devices sitting behind a router can share a single external IP address. In this case, we can see many queries within a short time window, that are totally unrelated to each other.
We are also observing client IP addresses sending a lot of queries in a row for a small number of domains, which can introduce bias.
The first thing we do to mitigate these cases is remove entries from client IPs that have sent more queries than 99.9% of clients IPs. The assertion here is these client IPs are likely to be used by many devices at the same time. They might also be the target of a DNS amplification attack. Ignoring them improves the performance of our algorithm.
We then remove duplicate (client ip, name) tuples, keeping only the most recent entries.
This ensures that a pair of names can’t be seen more than once for a given IP, and reduces the number of queries for a single hour from 4 billion down to around 400 million.
We also remove queries for invalid domain names, accounting for 2% of the unique (client ip, query) tuples.
Temporal proximity of related malicious domains
We need to validate our intuition that DNS queries for related domains might be frequently observed in a small time window.
Let (M) be the set of domain names that we already flagged as malicious, and (t_i(c)) the timestamp of the first query sent by a client (c) for the domain (i).
For each client, we find all the unique pairs of domains ((i,j)) so that (i,j in M^2) and (i < j). We then compute the time difference (left| t_i(c)-t_j(c) right|) which can’t be zero due to the architecture of our DNS resolvers.
The histogram of (left| t_i(c)-t_j(c) right|) for all clients and all pairs of malicious domains appears to be gamma distributed, with a shape of 0.56 and a scale of 134.65, for timestamps expressed in seconds.
The probability for two malicious domains looked up by a client (c) to be related can thus be expressed as [f_c(i,j) = frac{0.04 e^{-0.007 left| t_c(i)-t_c(j)right| }}{left| t_c(i)-t_c(j)right| {}^{0.436}} ]
The co-occurrence score
Let (D_c) be the set of domain names looked up by a client (c).
In order to compute a global co-occurrence score, we define (g(i,j)) as the summation of (f_c(i,j)) for every client (c) having sent queries both for (i) and for (j). If no clients have sent a query for both (i) and (j), (g(i,j) = 0).
For any given malicious domain, we could use this function to find other domains sharing the same web pages, the same infection chain, or serving the needs of the same malware sample.
Unfortunately, this doesn’t work very well in practice.
It’s not uncommon for malicious web sites to load resources from third-parties, like banners and social network widgets.
Furthermore, users that are infected or in the process of being infected keep reading their email, and browsing common web services.
For example, queries for google-analytics.com, yieldmanager.com, scorecardresearch.com, msftncsi.com and apple.com are frequently seen around the same time as queries for unrelated domains, including malicious ones.
And the high co-occurrence score of these domains paired with others doesn’t help us discover new domains names that we should take a closer look at.
We thus refine the function to lower the score for ((i,j)) if (j) happens to be a domain requested by a lot of client IPs.
Let (D) be the set of all domain names observed.
[ s(i,j) = frac{g(i,j)}{sum _{kin D} g(k,j)} ]
The actual score we use is normalized:
[ s'(i,j) = frac{g(i,j)}{left(sum _{kin D} g(k,j)right) sum _{kin D} s(i,k)} ]
Two case studies
A few months ago, suspicious queries for www1.hsbc.ca caught our attention.
This name didn’t resolve, never did, and we suddently saw a spike of traffic for it.
Like many .ca domain names, hsbc.ca usually gets traffic coming from clients in Canada, US and Great Britain, but what we observed for www1.hsbc.ca was also unusual:
The highest co-occurrence scores for domains paired with www.hsbc.ca were:
A new DGA pattern was clearly emerging here.
Diving into the co-occurrences for these DGA domains unveiled many more domains following the same pattern.
These domains happened to be C&C domains for the W32.Xpiro.D malware.
More recently, a slew of weight loss spam hit our mailboxes. Given only one domain name, we were able to figure out more, even if they happen to be hosted on different IPs:
By looking further at the co-occurrences, we also found the SEO/Social Media Marketing company which is very likely to be responsible for this campain.
More use for the co-occurrence score
The co-occurrence score has proven to be very useful in order to discover new domain names from domain names that we already know to be malicious.
But it is also the foundation for another algorithm that can lead to new malicious domains without having to explicitly provide a starting point. This algorithm will be described in detail in a forthcoming blog post.