DNS data (unlike webpage data) does not contain a significant amount of lexical content information. The only lexical content available is the domain name. Domain names are limited by 253 characters and this makes it challenging to make inferences about the validity of content on the webpage. Consequently, one cannot rely solely on the domain name to help detect new mass spam domains. Luckily, one can use other signals present in DNS data to potentially detect possible spam domains. Today’s blog post will be a brief introduction into a set of methods that can be used to identify possible spam domains in your DNS data.
In previous blog posts we discussed using DNS traffic patterns to categorize different types of sites. The pattern examined was a spike in traffic. Many classes of spam domains also exhibit spikes in traffic either when people click on links contained in the spam email or when a domain is used as a exploited mail relay. Identifying categories of spam domains was aided by the use of URIBL.
URIBL is a realtime URI blacklist used by many mail servers to determine if an incoming piece of mail is associated with a URI on the blacklist. Luckily for us queries to URIBL are made via DNS. This allows us to collect information regarding the domains queried by various mail servers during the course of the day. After collecting a couple hundred URIBL queries and parsing these queries for the embedded URIs we have a list of domains potentially associated with spam.
We are most interested in signals regarding the query patterns of these domains. Our hypothesis is that a certain group of these domains all have very similar query patterns. One query pattern that we have studied is the spike. Consequently, we check what domains from the URIBL list also appear on our spike list. Interestingly about 65% of domains that appear in our URIBL list also appear in our list of spiked domains. This is a strong indicator that we have a found a useful signal to identify potentially spammy domains.
Spikes that fall into the spammy category have two special characteristics that distinguish them from other domains that have spiked. First, the height of the spike for these domains occupy the 98 percentile of spike height. They have substantially higher in size than other domains that have experienced a spike. This could be caused by the amount spam emails sent out. Second, they usually include a qtype of 15 which refers to a mail server request.
Here is an example of how one such spammy spiked domain appears on Investigate:
What was interesting was then following up on the IP that hosted this domain. Interestingly it hosted set of similarly patterned domain names that also all happened to spike in a 2-3 hour window.
This lends further evidence that this domain was potentially involved in a spam run. Unfortunately, these domains either have WhoisGuard protection or do not have any available Whois information to help us to further follow up research. A future challenge for researchers is to determine where these domains lie in the overall spam chain. The majority of these domains no longer resolve or load up blank pages. Identifying the actual content on the page might help us discover their role in the future. DNS data might not contain the content information of webpages but in conjunction with other sources of data it can be used to identify potentially new threats.