
Last Wednesday, when he first saw Kaspersky Lab’s report on the Carbanak attack campaign, OpenDNS security researcher Jeremiah O’Connor had one reaction–get me the data. For the past three months, O’Connor has been working on developing a new model for advanced threat detection that applies algorithms most commonly used in fields such as bioinformatics and data mining–not information security. O’Connor’s model, named “NLPRank,” is based on natural language processing (NLP) techniques that, combined with OpenDNS’s data, has led O’Connor to build an advanced threat detection lexicon–essentially a “malicious language of the Internet” that detects threatening domains in real-time.
Disclosed today in a post on the OpenDNS Security Labs blog, O’Connor designed NLPRank specifically to predict both opportunistic phishing campaigns and attacks directed at high-value targets, such as financial institutions.
O’Connor first envisioned NLPRank shortly after the DarkHotel attack campaign was revealed in November 2014. Looking at data related to these attacks, he could intuitively see that the domain names followed patterns similar to those associated with the Mandiant APT1 espionage group.
“The way that attackers ‘sell’ a spear phishing attack is by spoofing a domain so that it looks like it comes from a legitimate company,” O’Connor said. “After running detailed analytics on the data from these types of campaigns, I found that these domain names were predictable.”
Using data from these two campaigns as a test set, O’Connor was able to start building NLPRank. This new model compiles a dictionary of popular, legitimate domain names used in spear phishing (such as “java,” “gmail” or “adobe”) and compares it with a list of the most common English words used in targeted phishing campaigns (“install,” “update,” “download” etc.). NLPRank then uses alignment techniques from computational biology to grade permutations of these domain names, like “install-ad0be”, and then judge the likelihood they will be used in spear phishing.
The next step is to apply a variety of techniques, such as ASN mappings, WHOIS data and HTML analysis to classify the type of attack being delivered. This process is applied by NLPRank across the billions of DNS records that OpenDNS observes every day.
This first iteration of NLPRank successfully built a lexicon that understood the malicious language behind these domains. It already has been used to detect a number of sophisticated phishing campaigns in the wild, such as the PayPal phishing sites reported by OpenDNS Security Labs earlier this month. The only thing missing was a list of domains from a recent, sophisticated and highly-successful phishing campaign to validate the list of domains that O’Connor was collecting every single day.
In the Carbanak campaign, attackers used techniques and malware borrowed from commodity banking trojans to pull off one of the most successful bank robberies in history–making it an ideal test case for NLPRank. After researchers from Kaspersky shared their data with the OpenDNS Security Labs team, O’Connor knew that his new model was effective. “When I looked at Kaspersky’s data, I could see the command-and-control domains,” he said. “NLPRank had caught them weeks before.”
According to OpenDNS senior security researcher Dhia Mahjoub, Ph.D., bad actors borrow tricks from commodity phishing attacks to hide the command and control traffic for targeted attacks. “The great thing about NLPRank is that the model uses the bad guys’ tricks against them and is generic enough to detect both types of attacks,” Mahjoub said. “This result is a great example of how the work we do at OpenDNS Security Labs allows researchers to use their security knowledge, technical background and innate creativity to make intuitive leaps.”
O’Connor says another thing that makes NLPRank unique is that it is applied in real-time to actual domains detected by OpenDNS’s worldwide data infrastructure. “Other methods of detecting malicious domains generate massive data sets that can miss combinations of domain names not predicted by their models,” he said. “NLPRank looks at every domain name in the context of not only the combinations of letters and numbers involved, but also a domain’s location on the Internet. We can see correlations between attack campaigns and use that knowledge to understand if they are connected.”
Earlier this week at a private research conference, O’Connor unveiled NLPRank for the first time. “Companies that have to deal with spear phishing attacks on a regular basis are very excited about this model,” he said. “It’s one of the only ways to detect spear phishing campaigns without analyzing malware in detail.”
Next steps for O’Connor and his work on NLPRank? He plans to continue building out the dictionaries that define the malicious language and sharing his work more broadly with the information security community.
NLPRank is one of several new models currently under development by OpenDNS Security Labs to detect anomalous online activity and malicious attack campaigns. The research team plans to unveil new methods and research in the coming months.