In November 2012 at the SF Data Mining Meetup, OpenDNS CTO Dan Hubbard shared the stage with data scientists from Pandora. On the surface, the two companies couldn’t have appeared more different that night — one an Internet security company, the other an early pioneer in Internet radio. But when it comes to data science the companies are similar in that they both rely on having unique sets of data no other company can match.
During this particular meetup, Hubbard outlined a new approach to using data science for network security. He explained the unique view OpenDNS had of the world’s constantly-shifting Internet addresses and how this vantage point resulted in powerful insight into the traffic patterns of millions of daily users. Now, almost three years later under Hubbard’s leadership, OpenDNS is unveiling two new detection models that have been developed by the company’s data science team. The first new model, called Spike Rank (SPRank), is a detection system that uses mathematical concepts more commonly used to analyze sound waves in real time — much like similar techniques used for Pandora’s Music Genome Project. The second model, Predictive IP Space Monitoring, uses the clues uncovered by SPRank to anticipate attacks before they take place. Together the two models expand the company’s applied artificial intelligence system for blocking online attacks.
Taking a Data Science Approach to Security
From the beginning, Hubbard realized that OpenDNS’s millions of users around the world gave his team a dataset that was unlike any other, even among security companies. The company sees over 80 billion Internet requests daily and has access to years’ worth of network traffic data. Starting in 2012, Hubbard began to build the OpenDNS Security Labs research team, making sure to include a mix of specialized security researchers and hardcore data scientists. He was looking to adopt a hybrid approach that would use data-centric techniques to revolutionize traditional security research. One of the first researchers Hubbard brought on board was Dhia Mahjoub, Ph.D, a data scientist with specific expertise in distributed sensor networks.
During that same data mining meetup in 2012, OpenDNS’s research team outlined a model that already could predict domain names generated by the notorious Cryptolocker ransomware and some botnets. By late 2013, this predictive system could identify and block connections to the domain names used by malware — days or even weeks before attacks were launched.
As a next step, Mahjoub shifted focus to study the more complex attacks based on exploit kits, one of the most popular forms of malicious software used to infect computers. Once an unsuspecting user’s machine has been compromised, bad actors can add the infected machine to a botnet, steal online banking information, or install further malware. For example, in one recent case, a campaign taken down by Cisco’s Talos research team was estimated to generate more than $30 million for criminals, annually.
Through his work, Mahjoub discovered that the criminals behind the latest wave of attacks were using an entirely different set of approaches to hide the servers they set up to power their campaigns from detection. They would automate the process of using legitimate websites and domains that they had already hacked to make malicious web traffic look legitimate. Other criminals would launch entirely new domains, hosted in the dark corners of the Internet or use a network of proxy servers to hide their online activities. These evasion techniques made it very difficult for an automated system to detect if a new subdomain should be blocked or not, until a significant amount of time (weeks or months) had passed after an attack had already been launched.
Finding “Ghost Noises”
To tackle the problem of identifying criminals online, Mahjoub enlisted the help of fellow OpenDNS data scientist Thomas Mathew. Together, the two began looking for patterns in network requests for compromised websites — trying to find a method that would reliably detect these attacks. “We started asking ourselves questions such as ‘What are a set of features that are hard for criminals to change? What is something that people don’t really think about?’” Mathew said. “Then I realized…if you’re thinking about network traffic, it’s really nothing more than a waveform.”
By looking at the network traffic in OpenDNS’s dataset as patterns, Mahjoub and Mathew could see that some domains (like gmail.com or amazon.com) have consistent high-volume incoming traffic. Others might have sudden spikes in traffic at regular intervals or follow some other pattern entirely. Mathew began cross-referencing these newly-discovered traffic patterns with the data already in Security Graph, OpenDNS’s database of “good” and “bad” Internet addresses. By examining how traffic patterns changed after they became malicious, Mathew realized that the traffic patterns closely echoed the sound waves that companies like Pandora classify every day.
“There’s already lots of mathematical theory that exists to describe sounds,” Mathew said. “Domains like Google and Yahoo! will have a similar ‘sound wave,’ because they get lots of regular traffic. The domains used in these attacks are only alive for a certain amount of time, so their patterns are much faster and shorter. To continue the analogy, these attacks sound like ghost noises — short beeps or chirps. Imagine a sound that appears for just a second and then is gone. You need to build a system that can match that pattern and identify those sounds as quickly as possible.”
Quickly, Mahjoub and Mathew discovered that this new system functioned as a kind of sonar for network security — it was able to quickly locate these transient patterns in the more than half a terabyte of traffic data that OpenDNS processes on an hourly basis. As they cross-referenced the domains tagged by SPRank (this new, patent-pending recognition system) with other systems, they found that it could identify these malware attack patterns with a high degree of accuracy. Now in production, the model identifies hundreds of compromised domains every hour — over a third of which are not detected by any other antivirus or antimalware scanner, according to VirusTotal. But even as they implemented a new system for identifying attacks in progress, the two wondered — how could they zero in on these attacks before they occurred?
Anticipating the Next Attack
After spending nearly two years studying how criminals were hiding “bad” domain names among “good” ones online, Mahjoub concluded that another model would be needed to predict attacks before they occur. Mahjoub used a more granular approach that catalogued the “fingerprints” left by bad actors’ collective infrastructure. He found that this method could identify over 300 new domains every hour that would be used to host malware in the future… allowing him to block these domains before an attack is ever launched.
Called Predictive IP Space Monitoring, this new model starts with the initial ‘clues’ found by SPRank. It uses eight major patterns in how servers are hosted to determine which domains will be the source of future malicious activity. For example, Mahjoub uncovered a new technique called domain shadowing, or using a compromised subdomain for a legitimate website (like “bad.opendns.com” instead of “opendns.com”) as the base for launching an attack. He also discovered how attackers could hide server addresses in a legitimate hosting provider’s infrastructure by manipulating the connections between networks. While no one indicator predicts an attack by itself, Mahjoub has been able to train his model over time by cross-referencing the list of predicted malicious servers with those that were actually involved in attacks.
This new model essentially scores every step in the process that a criminal goes through to set up their own infrastructure — from choosing a hosting provider to deploying server images — to determine whether an attack is going to take place. By focusing on these unchangeable characteristics, Mahjoub’s model is able to ignore the individual evasion techniques that criminals employ and focus on identifying the overall pattern that precedes malicious activity.
“With this system, SPRank finds the clues, but analyzing the overall hosting infrastructure with Predictive IP Space Monitoring cracks open the case,” Mahjoub said.
Mahjoub is the first to say that this new artificial intelligence system in use at OpenDNS isn’t static. The point, he says, is that the bad guys are constantly finding new ways to hide online which forces the team’s own models to constantly adapt and learn from new data.
“When we build a model, it’s not like we can just build it and then go to bed,” Mahjoub said. “You need to constantly update it, because the bad guys are doing the same thing on their end.”