Evolutionary data is a collection of past events and circumstances. Understanding it can be extremely valuable, because it reveals history, brings insights to the present, and often times forecasts the future well. In this post we’ll outline some useful techniques for visualizing evolutionary data and provide tips to make a powerful impact.
At OpenDNS, we possess a huge amount of evolutionary data for domains. First of all, we see every single query our customers have made, which depicts the query volume and the change of infrastructure for one domain over time. We also keep record of key timestamps, for example, the time when a domain firstly appears in the query logs. Then, malicious domains usually have more interesting data: e.g., the time when they get blocked by us. Lastly, we have the most complete record of whois data about one domain, from the moment a domain is registered, to its expected expiration date.
For continuous evolutionary data, such as a domain’s query volume over time, a simple line chart would be effective enough. In this blog, we are particularly interested in the other type of evolution data: time-to-event data, for instance, the time when a domain is created, or the time when a domain is flagged by us or another source. Security researchers at OpenDNS have been using this data for their work in the past.
We’ve published a previous blog about visualizing the life for one domain, but it’s important to look at things at large scale. By visualizing the time-to-event data for group of domains, we are able to find patterns and outliers, and re-examine our models.
There are two types of graphs in this visualization tool. The first one is a group timeline (see figure 1). The idea is simple: firstly we draw a timeline for every event type, then for each domain, we mark the timestamp of an event on its own timeline with a circle, finally use line to connect all the events (circles). Therefore, each line on this graph represents one domain’s evolution. The slope of the line indicates the chronological order of events. In fact, when you only have two events, it’s a SlopeGraph invented by Edward Tufte.
The second type of graph, box plot, is a purely scientific way of showing the distribution of numerical data. To draw a box plot you need to calculate five statistics of your data: minimum, first quartile, median, third quartile, and maximum, see figure 2 below. Data points outside the range from minimum to maximum are outliers.
A box plot shows rich information within limited space, and is particularly useful for comparison between multiple sets of data. On the other hand, it does take some effort to interpret it for first-time viewers.
Using this tool, you can choose any pair of event types from your dataset, and it will calculate the time interval (number of days) between them, and draw the box plot.
Model Efficacy Analysis
Our security researcher Jeremiah implemented NLP-Rank , a model good at catching phishing domains. Let’s visualize a sample of malicious domains NLP-Rank has caught recently(See figure 3).
On the timeline, you will notice most of them are vertical lines, indicating no latency between when OpenDNS observes a domain for the first time (ODNS first_seen) and when we blocked it (ODNS first_tag), because nlp-rank is able to flag a phishing domain the moment it appears in our logs. The box plot is echoing this result with all the statistics equal to 0 day.
However, the lines with irregular slope on the timeline (highlighted in orange) stand out, which correspond to the negative outliers on the box plot. In fact, there are a few complex reasons behind their existence. The main one is when a phishing domain was observed for the first time but it was not serving content yet, so we cached the domain after which it started a phishing attack. The second reason for negative outliers is that processing large logs can introduce some latency. Sometimes a phishing domain is live just for couple hours, and when we try to retrieve content, it’s already down. Lastly, we might also miss things during the gap when researchers maintain and upgrade the model. You can read more about nlp-rank.
Now let’s cross-check OpenDNS response with VirusTotal’s feed, by adding another event type from VirusTotal: the first time when VirusTotal detects a malicious url that has positive detections(VT_FirstFlag).
First of all, out of 267 domains, VirusTotal returns first_flag results for 165 of them.
In figure 4, looking at the box plot, the box (the lower to higher quartile) is located on the negative side of the scale, along with more negative outliers, which indicates that nlp-rank caught these domains earlier statistically. Accordingly, these domains have lines with negative slope on the group timeline. However, there are positive outliers, which is worth further investigation.
Exploit Kits Analysis
Following the same method let’s analyze some recent exploit kit domains. In figure 5, we can quickly identify a few things:
- We usually block a EK domain a few hours after we see it.
- Our results are basically aligned with VirusTotal, as in the second box plot, most of the statistics are zero.
Now if we add another ingredient, the time a domain gets registered (Registered), the correlation becomes interesting. In figure 6, most domains were only “put into use” a certain period of time after they were registered, from a few days up to almost a year. This happens because attackers hijacked benign domain registrants and then create subdomains for their malicious content. This technique is also referred to as domain shadowing. However, a small family of domains are not using domain shadowing but instead are freshly created for dedicated delivery of exploit kits. They are highlighted in orange in figure 6, and listed as below if you are interested:
Through the above use cases, we demonstrated how visualizations such as box plots and slope graphs could be used in security research. However, to make them even more enticing, these visualizations can be expanded by using a few interaction techniques to facilitate people explore their own dataset.
Filtering & Sorting
Often times visualizations provide the big picture, and ideally point out the direction of your next step. In this tool, filtering allows you to cross out some events that are not relevant, and give focus to the “real meat” of the analysis.
In addition, the ability to sort the timelines helps users find interesting correlations between two adjacent timelines events, which would be difficult to find in a text-only format.
Details on demand
Instead of calculating a box plot for each of the two events, the tool will provide this extra information when you want it, and you are free to select any pair of metrics matter.
Hovering to highlight and view domain detail is also supported.
Although we didn’t apply colors in above examples, coloring can provide the users more useful information. You can use color to represent different sets of domains, so it’s easy to compare. Or, you can even use a linear color scale to represent quantitative data. In figure 1, the coloring is actually encoding the creation date, so the older the redder.
In this post, we analyzed phishing domains and exploit kit domains. Another interesting analysis is to compare the evolution pattern for different sets of the domains, such as domains by different attacks, threat types or classifiers.
Furthermore, If we could put this specific slice of data into a larger context, use it with other type of data, for example the query volume over time and the neighborhood connections within a large graph, it would reveal a clearer picture of the domains in question. However, it will certainly bring up more challenges to design, as we want to display more information but maintain systematic simplicity and great user experience at the same time.