In 2013, email spam accounted for approximately 69% of all internet email traffic [Kaspersky]. Economists predict email spam costs American businesses and consumers approximately $20 billion annually, with spammers making a return of approximately $200 million per year [Rao, Reiley]. Rao and Reiley also note that spam provides nefarious individuals with one of the cheapest returns on investment as the production of spam is incredibly cheap.
Today’s blog post investigates a spam network and identifies behavioral patterns that could be useful for further research. Spam is interesting because it can be highly networked activity that requires coordination between multiple end computers distributed throughout the globe.
The Spam Run
In late August, the OpenDNS Security Labs team noticed an unusual surge in DNS queries for a mail server. The figure below shows the rate of DNS queries/min over the span of 24 hours. Visual inspection indicates that a definite surge occurred towards the end of the day. The steepness in the slope for queries/min made us suspicious about the reason behind the spike. For comparison, we created a DNS query/min graph for the previous day to get an idea of more baseline query requests. The second graph shows that traffic follows an expected diurnal cycle with traffic easing off in the early morning and evening.
Query spikes are a good first predicator for suspicious behavior but do not usually provide enough evidence to label behavior as malicious. Therefore, to get a better picture of the attack we captured 68 IPs that had an unusually high query rate and mapped them geographically. The image below shows their geographic distribution:
The IPs had a wide geographic distribution with a small cluster in Ukraine. Furthermore, each of these IPs belonged to networks belonging to residential ISP providers. This new information validated our initial suspicions that the uptick in the queries had come from a spam run. It was now time to investigate the nature and scope of the spam run.
The Network
To get a better picture of the spam network, we ran a query which collected all of the DNS queries made by the 68 suspicious IPs for the current and past day. These logs were then filtered down to contain only MX records and A records for mail servers. With structured data, we needed to use a data structure which would highlight the interconnections between the different agents.
Structuring the data as a bipartite graph between IPs to mail servers/domains serves as an easy way to understand the data. A bipartite graph is a graph composed of two independent sets. It allows us to easily calculate which IPs contact the same or similar mail servers. The figure below is a example of the spam network, consisting of 1000 mail servers and six suspected spam IPs. Mail servers are the blue nodes, and IPs are red nodes. Each of the IPs have a cluster of mail servers that only they contact and a few mail servers that are contacted by the whole group. The same structure exists for the larger version of the graph (68 IPs, 40k domains, and 281k edges).
Analyzing the type of mail servers contacted reveals an interesting hierarchy. The most frequently contacted mail servers were servers belonging to large, well-known mail providers such as Yahoo, Google, and AOL. A second tier followed which consisted mainly of academic institutions. This was followed a mid-tier of various proxy websites. Finally, at the bottom was a set of low-profile websites. Low profile websites contacted by suspicious IPs had two features in common—they all received sporadic traffic (sub 500 queries per day) and were located in the same geographic zone as the IP sending them email. These low profile websites represent the nodes with degree one in the graph.
With such geographic variance amongst the IPs, it comes as no surprise that domains residing in 19 different country TLDs received spam. The .jp TLD received the most queries with ~800k.
Now that we had a better idea of how the IPs were structured in relation to the domains, we wanted to understand how they might be possibly sending mail or communicating with one another. To do this, we went back to the bipartite graph and did some edge analysis, looking for domains that lay in the ‘sweet spot’ of degree counts. We knew that nodes that were talking to a high percentage of the 68 IPs might be in some way connected to the general spam network.
Results
This general technique led us to identify four domains of interest. The first was the periodic request to an IP in Mexico. Some further investigation on this IP led to some interesting results: this IP served as the mail server for a set of Russian hosted domains. This itself was suspicious—a domain hosted in Russia, relying on a mail server located in Mexico. Also suspicious was the fact that this mail server was associated with a set of constantly changing domain names and IPs. In the span of one month there were a total of 3 IP changes and 10 domain name changes. Below is a graph showing the querying frequency of this domain over the course of the day:
Only five of the 68 IPs did not make attempts to contact this dial-up IP during the course of the day. We suspect this IP is used as a email forwarder or open-relay for the spam campaign.
From the remaining three suspicious domains, two shared similar structure with the IP in Mexico (multiple domain names associated with changing IP addresses). The last domain was a mail server misconfigured so it could serve as an open relay.
Conclusion
Spam is interesting because it often serves as the vector of infection for botnets or banking trojans. The majority of websites targeted by spam were small e-commerce businesses. It would not be surprising if these spam campaigns target small e-commerce websites with promises of boosting their SEO positions—e-commerce operators unknowingly click on the possibly fraudulent SEO link and become compromised. Therefore, a point of further research is to determine how many of these websites targeted by spam become either compromised websites or part of a botnet. This type of analysis can be useful in developing models that predict relationships between receiving spam and possible future infected domains or users.