Last week we posted an examination of whether the location of where a domain is hosted increases its likelihood to be malicious. Indeed, we confirmed that some countries are hosting a significantly higher ratio of malicious sites than clean sites. But rather than rest on a superficial assumption based on the geography of where a domain is hosted, we wanted to more deeply explore the relationship between geography, ccTLDs and malicious domains.
Unlike generic top-level domains (.com, .net) that most anyone can buy, an Internet country code top-level domain (.fr, .tw) is generally reserved for a country or a dependent territory. If a website is using a specific ccTLD, it suggests that its operator’s intention to target a local audience. That said, registrars have largely relaxed the rules and a lot of ccTLDs can now be registered by non-local businesses and individuals, possibly rendering ccTLDs less relevant.
The co-occurrence matrix between ccTLDs and the actual countries clients are connecting from shows a strong correlation. Most often, websites opting for a country-specific domain are actually serving content for a local audience.
Looking deeper, we observe very different frequency distributions when comparing ccTLDs, that can be explained by linguistic and cultural factors. Building ccTLD-specific models is thus critical in order to help us decide whether to classify a domain as benign or malicious. Below, I’ll discuss some of the specific models we use.
The servers’ physical geographic diversity
The number of IP addresses and the stability of the set of IP addresses are important signals when determining whether a domain is likely to be malicious. But, it’s also very common for totally benign domains to also use multiple IPs. This is a common practice for load balancing, redundancy, optimizing latency for a large country, or to take advantage of “elastic” infrastructures.
In the following experiment, we use two training sets of .RU domain names, containing only domain names resolving to more than one IP address. All IP addresses seen over a one-week period were considered. One list contains domains known to be benign and the other list contains domains known to be currently used as infection vectors.
On these two sets, we computed the mean distance between the country’s geographic median and all the physical locations of servers hosting a name.
We observed a significantly different skewness. Hosts serving a non-malicious domain tend to be geographically close, whereas a domain serving malware can be served by hosts spread all around the globe.
Looking at the number of distinct physical locations also shows how malware can use a fast flux pattern. Fast Flux is a specific category of domains that take advantage of the fact that the set of IP addresses returned for a domain name is only valid for a limited period of time, over which the domain owner has full control. A botnet operator can leverage this feature to very quickly switch to a different set of hosts in order to serve a malicious payload.
While 98% of benign domains having multiple IP addresses are only served by at most 3 datacenters, and show a negligible number of outliers, we see that malicious domains can very quickly hop from one host to another. One of them even scored 867 physical locations!
Locations | Domain |
---|---|
867 | lafdamow.ru |
505 | girwysca.ru |
443 | wascadux.ru |
418 | ajgijuap.ru |
374 | jilvoqsi.ru |
326 | enhawcus.ru |
289 | taosiram.ru |
253 | hevlehaw.ru |
242 | diteqciq.ru |
200 | vehyfgor.ru |
196 | zurgovod.ru |
185 | sepsiqbo.ru |
147 | nuzejviz.ru |
145 | etujaqhe.ru |
119 | marsotrip.ru |
103 | zazzeqan.ru |
103 | azvaebyn.ru |
Having hopped to 92 countries, 665 ASNs, 1486 network prefixes and 2780 IP addresses in 7 days, the lafdamow.ru domain name is an obvious outlier that we quickly blocked as malware. As a researcher, outliers like this are almost impressive in their ability to change and move around so rapidly.
The requester’s geographic profile
Our intuition, confirmed by the co-occurrence matrix above, is that the frequency distribution of countries from which traffic is sent to ccTLD domains is predicable with a good accuracy.
The .RU ccTLD, for example, shows this expected distribution for benign sites:
A vast proportion of queries to .RU domains are coming from Russia, followed by Ukraine and US, other countries being almost uniformly distributed.
However, we observe that malicious .RU domains show a totally different distribution of requester countries. They receive few queries from Russia, Ukraine, Belarus and Kazakhstan, and a vast majority of the queries are coming from the US.
We use the Kolomogorov-Smirnov test to compare these distributions, after discarding the countries presenting a high variance and countries seen in the expected distribution, but not in our observations. In our experiments, the result of this test happens to be a pretty unreliable feature to label a domain as malicious. However, this is an extremely important feature to label a domain as benign, with only 0.02% false positives.
The lexical features
Domains used to serve malicious content don’t need to have any meaningful content. In fact, not having any meaningful content could even be a strategy to avoid being seen by search engines.
Our intuition is that ccTLDs and languages are tightly coupled, thus domain names from a specific ccTLD show predictable lexical features. A .RU domain is likely to contain a lot of Russian-specific sequence of characters, that are unlikely to occur in a .CN domain. And sequences of characters that don’t match anything we expect in English can be very frequent in Russian.
For this reason, we built a training set of .RU domain names known to be benign, from which we computed the unigrams to quadrigrams frequency distribution.
We then defined a “DGA score” function, whose output represents how wrong our guess for the next character of a domain name is, considering the 1 to 4 previous characters, based on our reference frequency distribution.
Pseudorandomly-generated domain names are usually easy to distinguish from human-generated, meaningful names. Thus, in the following experiment, we build a set of malicious names known for serving malware, but not part of a botnet leveraging algorithmically-generated names.
This DGA score is computed for a distinct set of domain names known to be clean, and for the list of malicious names.
While the lexical properties of name is far from being sufficient for classifying a domain as malicious or not, we observe that it is still an significant feature to use.
The Umbrella Security Labs is now blocking 80,000,000 malicious, botnet or phishing requests each day. Given the huge variety of malware, it’s clear that there’s no one-size-fits-all model. Our team uses the three models described above to detect ccTLD-specific anomalies. While there is certainly much to gain from the use of these models, we’re relentless in our quest to identify new models and algorithms that can inform us about the likelihood of a domain’s classification. Those models vary from general to specific, but they’re all contributing to greater protection for our customers.