As described in previous bloggings by our very own Dhia, the Gameover Zeus malware has had two known variants, commonly referred to as oldGOZ and newGOZ. Both versions use an algorithm seeded on time to dynamically generate domains to contact for instructions. In this post, we dig deeper into newGOZ’s domain registration history and query volumes to identify patterns.
Through query co-occurrence and the predictability of the malware’s DGA we were able to collect a sample data set of domains belonging to newGOZ, which were registered between August 1 and September 20. For each domain in the set, registration times and query spike times were collected. The registration time of a domain is the hour when a domain was registered while the query spike time is when OpenDNS first saw an increase in the rate of queries being sent to our resolvers. Below is an image that demonstrates the query spike of one newGOZ domain, i7p1t11cgk79x1nl3qbrbrydhm|.|com.
This domain’s query spike time was at 22:00 on 9/9/2014, while the registration time was at 13:44 on 9/9/2014. Subtracting the registration time form the query spike time, a delta is calculated. In this case, the registration to query spike delta is negative -8.266389 hours. Each domain and delta in the data set was also given an artificial feature of one to simulate a two-dimensional space appropriate for graphing. Whois data was then added to the data set including: the email address used to register the domain, the name of the registrant, the physical address of the registrant, and the registrar used to register the domain. Unfortunately, not all registration information was available and some values had to be set to ‘none’.
Exploring the Data
While manually inspecting (using unique, sorts, etc.) the email addresses used to register the domains we observed many email addresses which are known to belong to security research community members. Domains corresponding to these researchers were removed from the data. Many of the email addresses belonged to proxy registration services. These email addresses may also belong to members of the research community, but cannot be confirmed and were left in the data set (if you registered a newGOZ domain through a privacy protection service feel free to let me know). The newGOZ domain deltas were then plotted in R and are shown below.
The majority of domains had negative delta values, indicating they were registered before a query spike occurred. This makes logical sense as registrants are likely interested in owning the domain during the spike time, and thus receiving the largest amount of traffic to their domain. Many of the domains also had a delta very close to zero. This indicates the registration time and query spike time are very close to each other. The following graph is the same data as figure 2, but with the data points replaced by the email address used to register the domain.
A window of plus and minus 25 hours was created around the zero delta value and removed. All domains which fell within this windows were registered 25 hours before or after queries for them spiked. The following graph is the same data as figure 3 with the +/-25 hour window of domains removed, only showing domains registered more than 25 hours before or more than 25 hours after the domain’s queries spiked.
Clustering the Data
Instead of relying on an arbitrary threshold of 25 hours to categorize the domains’ deltas, we applied (using the fpc package for R) the DBSCAN algorithm on the deltas and the artificial feature. This unsupervised machine-learning algorithm was able to identify clusters of similar deltas based on their density. The graph below is the same data as figure 3, with colors representing the clusters DBSCAN identified while using an Euclidean distance of 75 minutes and a minimum cluster size of 3.
Cluster 0 is what DBSCAN identified as noise. Eliminating the clusters and plotting only the noise would result in a graph very similar to that of figure 4. Using DBSCAN instead of trusting my analyst’s intuition made me feel better because it involved math.
Exploring the Clusters
After removing the noise from figure 5, a more focused plot can be created. The graph below shows the clusters DBSCAN identified more closely.
The cluster identified seem to represent domains registered before the query spikes, just as the query spikes are occurring and after the query spikes. Mapping these clusters to the registrars used by each cluster, the below graph was generated.
Enriching the Clusters
Using the recently released OpenGraphiti engine, we built and visualized a semantic network of registration information for both domains belonging to cluster and for domains considered noise. In the case of the domain using a proxy registration service, the domain was not considered. First, the newGOZ domain registrant email addresses were taken from the data set. Then, all the domains registered with those email addresses as registrant email addresses (including ones that were not part of the newGOZ DGA) were identified. Lastly, the registrant name and physical address were added to the data set. The first graph below shows the registrants of the noise domains not using a proxy registration service.
From left to right the node types are, registrar, domain name, registrant email address, registrant name, and registrant physical address. The registrant email addresses of these noise domains map to single registrant names (which map to single registrant physical addresses). Each registrant has one name and each email address belongs to one person.
Inspecting registration information of the domains belonging to clusters identified by DBSCAN (again ignoring proxy registrations), it seems registrants have much less consistent identities. While registrant name to physical addresses remained a 1:1 ratio, name to email address relationships were found to be 1:1, many-to-one, and one-to-many. These email and name sharing relationships between the registrants of cluster domains is shown below.
The node types depicted above take the same structure as figure 8 where nodes (from left to right) represent the registrar used, the newGOZ domain, the email address used by the registrant, the registrant’s name and the physical address of the registrant.
Putting It All Together
My intuition tells me that patterns in registration to query spike time do not generally occur in non-DGA co-occurring domains, but we have yet to prove this theory. In the case of newGOZ we were able to identify a distinct pattern in registration information of domains with significant (clusters) registration to query spike time deltas that was not present in the registration information of domains with registration to query spike time deltas considered noise. We will keep you posted as we further investigate this pattern.