Intro
Data-Ink Maximization – is the concept of making every keystroke count (including the delete character), popularized by Edward Tufte. One famous example of this is how he redesigned the scatterplot into what is known as a rugplot.
Simplify, then minimize. Add lines, that’s key. So, let’s try visualizing time-dependent graphs with Tufte’s inspiration, with a twist. Let’s visualize rotating infrastructures. That is, let’s capture new hostnames (for example mail.google.com) that are resolving to a hosting IP from hour to hour.
One additional restriction is to find a solution using Matplotlib and NetworkX. Maybe we can write something quickly. Pasted below is source code to do this yourself.
One Hosting IP
Given two graphs of fictitious hosting IPs hosting hostnames at one hour, then the next, we can build a graph for each time. The challenge is to visualize the evolution. In other words, the challenge is to compare two graphs that are time-dependent.
Here’s our simple answer: draw lines from one hour to the next. Draw a line from hosting IP A to A between the time windows. Below is an example of doing just this:
FIGURE 1: Following hosting IP A from one hour to the next.
With the guideline following hosting IP A from hour to hour, we see the density of hostnames connected begin to vary. This variation is due to A resolving more hostnames in the second hour.
From a security perspective, an increase in the number of hostnames resolving on a hosting IP may indicate malicious or unintended behavior. For example, if we assume a hosting IP resolves a constant number of hostnames from one hour to the next (obviously a huge assumption), the increase in the number of hostnames resolving may be due to an IP starting to host a series of Exploit kit [1] or phishing domains.
Multiple Hosting IPs
Our next example, just builds on the first by overlaying more lines. Notice, how the lines begin to convey a certain amount of information about the complexity and density of the clusters in the graph.
FIGURE 2: Following hosting IPs: A,B,C,H,S from one hour to the next.
By increasing the number of guidelines we are now tracing multiple hosting IPs from one hour to the next. We can compare the density of the connected hostnames per hosting IP. In addition, we can begin to identify any connections from Hosting IP to hostname to Hosting IP.
That is, hosting IP A and H in the first hour had nothing in common while in the second hour they had two hostnames in common. With the guidelines we can quickly re-trace two time-dependent subgraphs and map their evolution.
From a security perspective, if hosting IP A and H had something in common in both hours, the resulting grid-lines would have completed a rectangle, a cycle, between the two time-dependent graphs. In this case, they form a tree-like structure. What makes this interesting, is that while hosting IP A and H obviously have something in common in the later hour, it is not clear they did in the previous. With the grid lines we recognize there might be evidence that the hostnames in the previous hour may be related.
We may therefore proceed, perhaps, from a known malicious hostname and begin to test whether other hostnames within the weakly drawn cluster (of the hostnames resolving to A and H) in the previous hour are also malicious.
Next
The above example simply traced one, two, or three hosting IPs from one hour to the next. But notice, we could vary this. We could trace domains just as easily, or, a combination of users and domains.
If you’re interested in graph analytics on time-dependent graphs definitely check out this paper authored by folks at AmpLab, Databricks, and Uber.
Source
You’ll want your data stored in files like g1.txt and g2.txt looking like this:
{“domsuf”:”jriugrkbfdkjhg.com”,”client”:”D”,”count”:3}
{“domsuf”:”jriugrkbfdkjhg.com”,”client”:”E”,”count”:3}
{“domsuf”:”x0vr8wn.net”,”client”:”A”,”count”:3}
{“domsuf”:”52mt2pm.org”,”client”:”A”,”count”:3}
Then you can run: