Whois data is often difficult to work with given its plethora of unstandardized free text formats, the fact that much of it is registrant provided (meaning it’s often untrustworthy), and due to privacy protection services which mask the real whois record. As whois data naturally has many inconsistencies and anomalies, directly mining bulk whois data proves challenging. Instead of mining whois data, the OpenDNS research team often uses whois record values as auxiliary features of suspect domain names. Whois data enriches our findings and helps weight the decisions made during a manual investigation, as well as an automatic classification. This post will outline a method, incorporating whois data, that the OpenDNS labs team is using to locate suspicious and malicious infrastructure on the Internet.
Domain Registration Notification System
Internally, we have built a basic proof of concept system which monitors whois records of newly created domain names for select TLDs. The system is configured with four basic inputs:
- field type – this is the name of a normalized field in a whois record (e.g. registrant email, creation data, registrant fax number)
- field value – this is the value of the field in a whois record
- recipient – the research team member requesting the notifications
- action – what is to be done upon a whois record containing the field value:
- notify – send an email to the recipient
- blocklist – automatically blocklist the domain matching the whois record criteria
- sinkhole -automatically sinkhole the domain matching the whois record criteria
We’ve been using the system for a little over a month now and have had some interesting results.
System Results: Whois Blocklisting
The ability to automatically blocklist a domain based on a field in a domain’s whois record is very powerful. To do so requires a high level of confidence in the whois field’s value. For example, automatically blocklisting a domain based on the registrar listed in its whois record will very likely result in a high rate of false positives. Blocklisting on a value which is likely to be unique to the domain’s owner, such as an email address or physical address, still has the ability to introduce false positives, but is much less likely.
Mining the blocklist our research team internally moderates and incorporating open reports from other security researchers, we have been able to identify, with strong confidence, compromised and dedicated email addresses. Using our notification system, we are able to blocklist a domain shortly after it is registered, and in some cases before it is used maliciously.
Another technique the labs team has begun using to identify suspicious domains is to monitor domain registrant email addresses for domains we currently consider malicious. The intuition here is that if OpenDNS considers example.com malicious, then user accounts at example.com, e.g. user@example.com, should at least be considered suspicious. Domain names registered by user@example.com are then also suspect.
System Results: Whois Anomalies
The final classification technique outlined in this blog also relies on monitoring newly created domain whois information and is rather similar to brand monitoring. This technique identifies whois records which have similarities to the whois records used by extremely popular domain names.
For this classification technique, we first created the idea of a brand. Brands are entities which own a number of popular domain names, typically used for different purposes. Brands in this classifier are modeled by gathering the whois information from OpenDNS’s top domain names (similar to the Alexa top domains) and intelligently merging the domains’ records. Many of the brands we monitor are commonly known large Internet companies. For example, OpenDNS owns opendns.com and internetbadguys.com. OpenDNS would be considered the brand in this case and the whois records for both domains would be combined.
With our brands modeled, we began monitoring for their whois information with the previously described proof of concept notification system. Initially we found mostly legitimate domains registered by our brands including quite a few unicode domains (which surprised us), likely for the purpose of protecting each of the brands’ reputations. We also found a few suspicious domain names that seemed to be completely unrelated to the brands we modeled, but were not yet confident in blocklisting these domain names. In an effort to remove the legitimate domain registrations, we identified heuristics which allow us to differentiate between these two types of registrations:
- Brands registering domains will often use a registrant email address which is somehow related to the brand and not a privacy protected email address.
- Domains registered by brands often resolve to IP netblocks owned by the brand.
- Within a short time period, brands will often register the same domain name with different TLDs (e.g. example.com, example.org, and example.net) as well as typo variations of the domain (e.g. example.com, exanple.com, ecample.com).
- Domains registered by a brand typically match the majority of a brand’s whois record and not just a single field.
- Brands typically register domains through a preferred registrar, often one providing brand protection services such as MarkMonitor or Corporation Service Company (CSC).
Applying these heuristics to the whois notifications matching our brands’ records has provided us with a system that identifies suspicious domains registered with whois information mimicking popular websites. According to policy set in place by ICANN, the act of providing inaccurate whois records is grounds for suspension or deletion of a domain. Getting registrars to take action on inaccurate whois records is a different story often being a case-by-case effort.
On the research team we’ve seen brand mimicking techniques before within the domain name itself. Some of the domains identified by Jeremiah O’Conner’s NLPRank, which applied natural language processing to domain name strings and often caught domains related to phishing, have also been identified with this brand monitoring technique. We on the research team love when multiple classifiers developed by multiple individuals converge on the suspiciousness of domains. It gives us warm fuzzy feelings.
Below is a screenshot of two of the domains identified as suspicious by both of our classifiers:
The following is a screenshot showing query volumes for another domain the brand monitoring system was able to identify. Note the domain’s registration date was June 14, the day OpenDNS blocklisted the domain for our customer was June 15, the domain was first seen resolving on June 17, and an immediate spike in query volume (often indicating a malicious domain “going live”) occurred on June 20.
Below is another domain identified by the brand monitoring system. In this case, note that the email address which registered siginin-users|.|com has also registered other highly suspect domain names. If we felt confident this email address was malicious, we could add it to the whois monitoring system and auto blocklist all domains with this email as the registrant contact email in its whois record.
Also note that at the time of writing this, signin-users|.|com would respond with HTTP 302 redirects and the Location header set to “https://google.com”. This is a rather suspicious thing to do as requests to google.com often also result in 302 redirects to www.google.com.
Future Work
One expansion of the whois notification system these classifiers are built upon is approximate whois record matching. By including a similarity score with each requested monitor, our research team will be able to identify whois records that approximately match. I’ve been testing the Jaro-Winkler distance as a matching function and have found good results thus far.
So What?
Whois data can be a pain to work with given the lack of standardizations and the range of cruft registrars accept. However, when combined with passive DNS and behavioral patterns, whois data can be used to identify suspect and malicious domains within a short time after registration. This shortening of the detection period is key to a strong security strategy.