• Skip to primary navigation
  • Skip to main content
  • Skip to footer

Cisco Umbrella

Enterprise network security

  • Free Trial
  • Contact us
  • Blog
  • Login
    • Umbrella Login
    • Cloudlock Login
  • Products
    • Product
      • Cisco Umbrella Cloud Security Service
      • Cisco Umbrella Investigate
      • Product Packages
      • Support Packages
    • Functionality
      • DNS-Layer Security
      • Secure Web Gateway
      • Cloud Access Security Broker (CASB)
      • Interactive Intelligence
      • Cloud-Delivered Firewall
    •  
    • Webinar signup
  • Solutions
    • By Need
      • Protect Mobile Users
      • Fast Incident Response
      • Web Content Filtering
      • Shadow IT Discovery & App Blocking
      • Unified Threat Enforcement
      • Reduce Security Infections
      • Secure Direct Internet Access
      • Securing Remote and Roaming Users
    • By Network
      • Protect Guest Wi-Fi
      • SD-WAN Security
      • Off-Network Endpoint Security
    • By Industry
      • Higher Education Security
      • K-12 Schools Security
      • Healthcare, Retail and Hospitality Security
      • Enterprise Cloud Security
      • Small Business Cybersecurity
      • Our Customers
      • Customer Stories
    • Ransomware Defense for Dummies book
  • Why Us
    • Fast Reliable Cloud
      • Global Cloud Architecture
      • Cloud Network Status
      • Cloud Network Activity
      • Recursive DNS Services
      • Top Reasons to Trial
      • Getting Started
    • Unmatched Intelligence
      • Cyber Attack Prevention
      • Interactive Intelligence
    • Extensive Integrations
      • IT Security Integrations
      • Hardware Integrations
      • Meraki Integration
      • Cisco SD-WAN
    • Navigation-dropdown-promo-SASE-madness_021721
  • Resources
    • Content Library
      • Top Resources
      • Cybersecurity Webinars
      • Analyst Reports
      • Case Studies
      • Customer Videos
      • Datasheets
      • eBooks
      • Infographics
      • Solution Briefs
    • International Documents
      • Deutsch/German
      • Español/Spanish
      • Français/French
      • Italiano/Italian
      • 日本語/Japanese
    • Cisco Umbrella Blog
      • Latest Posts
      • Security Posts
      • Research Posts
      • Threats Posts
      • Product Posts
      • Spotlight
    • For Customers
      • Support
      • Customer Success Hub
      • Umbrella Deployment Hub
      • Customer Success Webinars
      • What’s New
      • Cisco Umbrella Studio
  • Trends & Threats
    • Market Trends
      • Rise of Remote Workers
      • Secure Internet Gateway (SIG)
      • Secure Access Service Edge (SASE)
    • Security Threats
      • Ransomware
      • Cryptomining Malware Protection
      • Cybersecurity Threat Landscape
    •  
    • Navigation-dropdown-promo-threat-report_020521
  • Partners
    • Channel Partners
      • Partner Program
      • Become a Partner
    • Service Providers
      • Secure Connectivity
      • Managed Security for MSSPs
      • Managed IT for MSPs
    •  
    • Become a partner
  • Free Trial Signup
  • Umbrella Login
  • Cloudlock Login
  • Contact Us
Research

Dominos, Botnets, and a little LSTM

By David Rodriguez
Posted on September 6, 2016
Updated on March 4, 2020

Share

Facebook0Tweet0LinkedIn0

Introduction

Suppose you were to watch a stream of numbers… and given the previous number you saw you had to predict what the next number should be. For example, suppose you saw 1,2,1,2,… We might guess: 1. Or perhaps, what if you saw 0,0,1,2,3… ? Should it be 4? It almost feels like a domino effect. In this post we walk through predicting a specific type of spike in domain queries associated with some botnets. These domains spike like falling dominos: 0,0,0,1,2,3….
In this post we will

  • Explore our ability to make predictions about the number of future queries to a given domain.
  • Provide a recipe that can be incorporated on an hourly basis to a given watch-list of domains.
  • Something you can do if you have Python, PyBrain, and some data from the OpenDNS investigate UI

The Problem

Below we track the amount of queries made to 8 different domains (per hour) for the last 5 days.
fig2
Here the queries per hour in a domain appear to follow a sinusoidal pattern (with minor hiccups). These patterns are fairly predictable. On the other hand, consider the following domains:
fig1
Here we see 5 domains spiking around the same time and in contrast we see two domains following a characteristic sinusoidal pattern. Obviously, we would like to predict when a domain will spike.
Perhaps you are familiar with recurrent neural networks and related problems, then you might want to skip down to the Github gist and just check out the code. Otherwise, carry on.

Recurrent Neural Networks and LSTMs

To do this we will use and train a recurrent neural network with long and short term memory (LSTM) using the Pybrain module. A recurrent neural network (rnn) is a popular choice when trying to classify or predict elements from sequential data (think of data from the stock market, speech, and even tweets). The most notable characteristic of rnns is that they are a neural network with directed cycles between adjacent neural network modules. While, on the other hand, LSTM networks are a special kind of rnn with connections between non-adjacent neural network modules. For more check out this overview.

Training

Like most neural networks we train them to minimize some sort of loss function related to the input and some output. In this case, we wish to input one number and predict the consecutive number. Therefore we wish to minimize our errors in predicting the next number.
err
 
Given about 100 domains and the last 5 days of queries per hour we train the LSTM network… over 5 epochs and 100 cycles. We see the error rate reduce over the training set as the epochs increase and the more data is utilized in training the network.
Here we feed in one sequence of queries from one domain at a time and then calculate the error over a subset of the training data per epoch.
 

Predictions

The first result we will study is a spiking domain. This domain, of course, is what we are mainly interested in, but it also serves as a good place to discuss the performance of the LSTM networks. In Figure 3 you will see the spiking domain with the actual query traffic in black and the predicted in blue.
reg5
We need to break down this graphic a little. First notice, how the black curve shows sharp transitions while the blue tends to lag and almost follow a rolling average. Second, notice how the blue curve appears slightly shifted to the right from the black.
With respect to the blue curve softening transitions and appearing shifted or translated to the right, we actually can identify the error in our LSTM. In fact, to more clearly see this I interleaved black diamonds on the black curve for every hour in which the domain was just about to increase. Similarly, I interleaved blue dots on the blue line for every hour in which the domain was just about to have a predicted increase in queries.
reg5zoom
In the figure to the left, we have zoomed into the moment prior to the largest spike. Here, we see the black diamond (circled in red) representing one of the largest amounts of queries per hour for this domain within the time window. Unfortunately, the blue dot (circled in red) shows what the LSTM network predicted the next value to be. Unfortunately, the LSTM network predicted the volume of queries in the next hour to decrease, while in reality the anomaly of the black diamond was more of an indicator that the domain was about to have a spike in the volume of queries in the next hour.
In this case, we see the limits of the currently trained LSTM network. For example, in the successive queries per hour, the LSTM tends to predict values which are relatively similar to the values that were given as input.
In fact, check out the following examples. These examples are like the one just described where the real domain traffic is in black and the predicted in blue.
reg1reg2 reg3 reg4
So as you can see from the figures above that the LSTM network we trained with PyBrain seems relatively limited in its ability to predict spiking queries in subsequent hours for a domain. So where do we go from here?
Well, notice, one of the upshots of our analysis: we are exploring our ability to make predictions about the number of future queries to a given domain. This is slightly different than detecting anomalies more generally, for the simple reason that the methods we have been describing are proactive, not reactive.
If predictive ability is what you’re after, then there are a couple choices for us. One choice, is begin to play around with the LSTM network structure. For example, trying out different number if hidden layers and units. In addition, given our initial results here we might be encouraged to take this model but train it on more data. If that’s the case, we may start to explore the relative payoff of using either Theano or Tensorflow.
In addition, we have tried to keep this analysis small and repeatable for anyone who has access to Python, PyBrain, and some of the data one can get from the OpenDNS investigate UI. We tried to keep this brief and are indebted to the PyBrain developers and StackExchange contributors.
Lastly, as we iterate and improve on this model, we have explored a recipe which can be incorporated into a watch-list of domains where every hour one retrieves the amount of queries to that domain, and predict the subsequent hour, and can with fairly simple methods detect when a domain with deviate substantially from the previous queries per hour. A simple test for anomalies is outlined in the code below.

Code Review


Thanks for reading.

Previous Post:

Previous Article

Next Post:

Next Article

Follow Us

  • Twitter
  • Facebook
  • LinkedIn
  • YouTube

Footer Sections

What we make

  • Cloud Security Service
  • DNS-Layer Network Security
  • Secure Web Gateway
  • Security Packages

Who we are

  • Global Cloud Architecture
  • Cloud Network Status
  • Cloud Network Activity
  • OpenDNS is now Umbrella
  • Cisco Umbrella Blog

Learn more

  • Webinars
  • Careers
  • Support
  • Cisco Umbrella Live Demo
  • Contact Sales
Umbrella by Cisco
208.67.222.222+208.67.220.220
2620:119:35::35+2620:119:53::53
Sign up for a Free Trial
  • Cisco Online Privacy Statement
  • Terms of Service
  • Sitemap

© 2021 Cisco Umbrella