Introduction
Suppose you were to watch a stream of numbers… and given the previous number you saw you had to predict what the next number should be. For example, suppose you saw 1,2,1,2,… We might guess: 1. Or perhaps, what if you saw 0,0,1,2,3… ? Should it be 4? It almost feels like a domino effect. In this post we walk through predicting a specific type of spike in domain queries associated with some botnets. These domains spike like falling dominos: 0,0,0,1,2,3….
In this post we will
- Explore our ability to make predictions about the number of future queries to a given domain.
- Provide a recipe that can be incorporated on an hourly basis to a given watch-list of domains.
- Something you can do if you have Python, PyBrain, and some data from the OpenDNS investigate UI
The Problem
Below we track the amount of queries made to 8 different domains (per hour) for the last 5 days.
Here the queries per hour in a domain appear to follow a sinusoidal pattern (with minor hiccups). These patterns are fairly predictable. On the other hand, consider the following domains:
Here we see 5 domains spiking around the same time and in contrast we see two domains following a characteristic sinusoidal pattern. Obviously, we would like to predict when a domain will spike.
Perhaps you are familiar with recurrent neural networks and related problems, then you might want to skip down to the Github gist and just check out the code. Otherwise, carry on.
Recurrent Neural Networks and LSTMs
To do this we will use and train a recurrent neural network with long and short term memory (LSTM) using the Pybrain module. A recurrent neural network (rnn) is a popular choice when trying to classify or predict elements from sequential data (think of data from the stock market, speech, and even tweets). The most notable characteristic of rnns is that they are a neural network with directed cycles between adjacent neural network modules. While, on the other hand, LSTM networks are a special kind of rnn with connections between non-adjacent neural network modules. For more check out this overview.
Training
Like most neural networks we train them to minimize some sort of loss function related to the input and some output. In this case, we wish to input one number and predict the consecutive number. Therefore we wish to minimize our errors in predicting the next number.
Given about 100 domains and the last 5 days of queries per hour we train the LSTM network… over 5 epochs and 100 cycles. We see the error rate reduce over the training set as the epochs increase and the more data is utilized in training the network.
Here we feed in one sequence of queries from one domain at a time and then calculate the error over a subset of the training data per epoch.
Predictions
The first result we will study is a spiking domain. This domain, of course, is what we are mainly interested in, but it also serves as a good place to discuss the performance of the LSTM networks. In Figure 3 you will see the spiking domain with the actual query traffic in black and the predicted in blue.
We need to break down this graphic a little. First notice, how the black curve shows sharp transitions while the blue tends to lag and almost follow a rolling average. Second, notice how the blue curve appears slightly shifted to the right from the black.
With respect to the blue curve softening transitions and appearing shifted or translated to the right, we actually can identify the error in our LSTM. In fact, to more clearly see this I interleaved black diamonds on the black curve for every hour in which the domain was just about to increase. Similarly, I interleaved blue dots on the blue line for every hour in which the domain was just about to have a predicted increase in queries.
In the figure to the left, we have zoomed into the moment prior to the largest spike. Here, we see the black diamond (circled in red) representing one of the largest amounts of queries per hour for this domain within the time window. Unfortunately, the blue dot (circled in red) shows what the LSTM network predicted the next value to be. Unfortunately, the LSTM network predicted the volume of queries in the next hour to decrease, while in reality the anomaly of the black diamond was more of an indicator that the domain was about to have a spike in the volume of queries in the next hour.
In this case, we see the limits of the currently trained LSTM network. For example, in the successive queries per hour, the LSTM tends to predict values which are relatively similar to the values that were given as input.
In fact, check out the following examples. These examples are like the one just described where the real domain traffic is in black and the predicted in blue.
So as you can see from the figures above that the LSTM network we trained with PyBrain seems relatively limited in its ability to predict spiking queries in subsequent hours for a domain. So where do we go from here?
Well, notice, one of the upshots of our analysis: we are exploring our ability to make predictions about the number of future queries to a given domain. This is slightly different than detecting anomalies more generally, for the simple reason that the methods we have been describing are proactive, not reactive.
If predictive ability is what you’re after, then there are a couple choices for us. One choice, is begin to play around with the LSTM network structure. For example, trying out different number if hidden layers and units. In addition, given our initial results here we might be encouraged to take this model but train it on more data. If that’s the case, we may start to explore the relative payoff of using either Theano or Tensorflow.
In addition, we have tried to keep this analysis small and repeatable for anyone who has access to Python, PyBrain, and some of the data one can get from the OpenDNS investigate UI. We tried to keep this brief and are indebted to the PyBrain developers and StackExchange contributors.
Lastly, as we iterate and improve on this model, we have explored a recipe which can be incorporated into a watch-list of domains where every hour one retrieves the amount of queries to that domain, and predict the subsequent hour, and can with fairly simple methods detect when a domain with deviate substantially from the previous queries per hour. A simple test for anomalies is outlined in the code below.
Code Review
Thanks for reading.