Here at OpenDNS, we have handle a large amount of data. This comes in very handy for us researchers as we hunt for new threats, and bring to life new ways to protect our customers. Although I often describe my job as “playing with malware all day,” I neglect to mention that I follow the scientific method and engineer solutions out of the results. Python is a great tool, in that I can write programs to test my hypotheses and everyone can still read the program like well-written notes. While testing one of my most recent hypotheses however, I found that the solution although existed, it was incredibly slow in Python. Our goal is to find threats as early as possible and protect our customers from them before they are infected. Otherwise, we might as well be an antivirus company.
Hypothetical DGA
The majority of my recent work has revolved around something called a Domain Generation Algorithm (DGA). The fundamental concept of a DGA is that malware will use this algorithm to generate a series of domain names in a deterministic fashion. Computer science buzzwords aside, this means when malware tries to contact aaaa1111.com, then tries to contact aaab1112.com, it will likely continue by trying aaac1113.com, then aaad1114.com, etc. Assume each new domain is generated once per week and we want to block everything this DGA creates proactively. What domain is generated after aaaj1119.com has been used? If only all DGAs were this easy. It could be aaba1120.com, maybe aaak1120.com, maybe even aaaj111a.com if the last four digits are hexadecimal. It may not have even been safe to assume aaaj1119.com would ever be generated. What if, after aaaf1116.com it generated aab01117.com?
Well this is a fairly nice case, where time is not completely against me, and I can find a sample of malware that reaches out to these domains, reverse engineer it, find the DGA, rip it out, and make it talk. Hypothetically, lets say this DGA took a seed consisting of a number and a sequence of characters incrementing the number by one and the sequence of characters by one character until rolling over at 9 and z respectively for 100 domains before starting over. Good, aaaj1119.com does happen, the next is aaak1120.com, and aadw1211.com is the last domain generated. Block them all and job well done, right? Unfortunately not.
As is usually the case, this hypothetical malware uses a seed to start its DGA (aaaa and 1111). Every time an updated version of the malware is released, there’s probably a new seed. If this was malware built from a kit, you can bet on every buyer having their own seed. Not a problem, we can block everything that begins with letters and ends with numbers… until nobody can reach office365.com or see San Francisco news on KRON4.com or any other legitimate domain that potentially crosses paths with the DGA. This occurs frequently in DGA research, so it is important to be more targeted in attempts to thwart DGA implementations.
Continuing with this hypothetical DGA, I know I have massive amounts of data to assist me. I can programmatically find every sequence of domain requests for each user for a domain beginning with characters and ending with numbers, and the following domain requested beginning with characters and ending with numbers. If the second one would be next in sequence from the first using that DGA, then success, but that’s just a single pair. Is it the beginning of a sequence, the middle, or the end? Assuming the generation of 100 domains is never going to change, this one is relatively easy. I already know which IP is infected, and which domains are being requested, so it is nothing to lookup other users infected and full histories to find the first domain in the sequence. The first domain tells me the seed (new configuration) and I can then generate 100 in the future and be done with it. I can leave the program running and identify any new configurations almost immediately.
Python’s limitations
Let’s start with the fact the OpenDNS currently handles about 80 Billion DNS requests a day. It is very difficult to keep up with almost a million anything per second, especially with a language that is not compiled and optimized. Add the complexity of a real DGA and you have to be very conservative with your clock cycles per test. Once you remind yourself that there are in fact many DGAs each with different characteristics, you realize quickly that you might be a day behind by the first minute you start analyzing. The ideal solution is to use C wherever possible (C++ for you object people).
C is for Python
There are many ways to squeeze performance out of Python: Cython, PyPy, SWIG, etc., but my preferred method is to write the C as efficiently as possible for a performance dependent function (those related to the DGA especially), and wrap it manually for Python to import as a module. Why do I continue to use Python? Because it is easy for others to read and quick to modify; remember, the scientific method is an ongoing process, and modifications inevitable. The DGA however, does not change, and it is the most computationally complex, repetitive portion of my domain tests. So there is every reason to promote it to C as a Python module.
A simple real example
A banking trojan by the name Banjori had a pretty simple DGA. It took a seed domain like internetbadguys.com and only modified the first four characters to generate a series of domains. An excellent resource on this particular DGA can be found here, including a very in-depth analysis on the algorithm itself. For each equation to generate the next letter, one of the characters is represented by its index in the alphabet (a=0, b=1, etc.), and the other characters involved are their ASCII integer. Those numbers are passed through the algorithm loosely represented below, and each new character is modulus 26 of the result added to 97 to get an ASCII character.

Using the example internetbadguys[.]com, the next five domains generated are:
- frqxrnetbadguys[.]com
- vpdbrnetbadguys[.]com
- pfkcrnetbadguys[.]com
- kgmgrnetbadguys[.]com
- jhnmrnetbadguys[.]com
If you’d like to follow along in C, this should help:
void nextBanjori(unsigned char* domain) {
domain[0]=97+((domain[0]+domain[3]-97)%26);
domain[1]=97+((domain[0]+(domain[1]<<1)-97)%26);
domain[2]=97+((domain[0]+domain[2]-1-97)%26);
domain[3]=97+((domain[1]+domain[2]+domain[3]-97)%26);
}
As mentioned above in Johannes Bader’s blog, the DGA is cyclical, meaning at some point the next domain generated will have already been generated and it will loop from that point ad infinitum. Bader also makes mention of “tail words,” which lead into loops, but never enter them. That means there are always two four letter words that can generate any four letter word that are part of a loop, but no four letter word can ever generate a tail word. This also means about half all possible four letter words are tail words, and the remaining are part of a loop.
Efficiency now matters
I need a targeted approach to block the entire loop of domains, including the tail word used. That is, if the sequence started with a tail word (remember, there is about a 50 percent chance the seed domain is a tail word). From the sequential domain set that I have captured from a user, I need to quickly determine if it could be the result of this DGA.
- Discard any domain shorter than [four letters] [‘.’] [top level domain] from the capture
- Is the length of the first domain equal to the length of the second domain?
- Are the substrings from the fifth element to the end for each domain equivalent?
- If these are both true, they are candidates, and the DGA should be applied to the first domain.
- If the first four letters of the domain post-DGA are now equivalent to the first four of the second domain, found one!
Step one can be taken care of from whatever code monitors the domains from the stream of choice. Steps two and three can be joined as one. Step four is already written, and step five can be a return true or false.
int banjoriTest(unsigned char* domain1, unsigned char* domain2) { unsigned char[4] d1 = {domain1[0],domain1[1],domain1[2],domain1[3]};
int index = 4;
while(domain1[index]) {
if (!(domain1[index]==domain2[index])) {
return 0;
} index++;
}
nextBanjori(d1); for (index=0;index<4;index++) { if (!(d1[index]==domain2[index])) { return 0; } } return 1;
}
That looks fairly efficient, but no major complexity. So we’ll keep that on the back-burner and just use Python for the time being.
Assume this program ran with the functions above and the following pair passed the test: vfxlsatformalisticirekb[.]com and zvogsatformalisticirekb[.]com.
Sweet, I have two of a chain which could be up to 15,373 domains long! Success? Not yet.
I can block the entire loop right now. That sounds reasonable. Just continue with nextBanjori(domain2) repeatedly until I see domain2 again. Why not the first domain? Think about it, what if it was a tail domain. That loop could ruin your day. Now of course, you could alter it to break on the first or second domain, or serve yourself a nice heaping dish of the halting problem. You could even search the entire loop to see if the first domain was present. And if so, determine if it is or is not a tail word.
If the first domain is in the loop, is there a tail domain missing or was one not used?
Help has arrived!
Introducing, sliced bread!
Pattern searching abilities are now available!
Ok, not quite sliced bread, but an incredibly useful tool available through investigate and the investigate API. I’ll just take the regular expression of any four characters followed by the rest of our first domain.
Paydirt.
Now I can cross-reference the the list I generated against the list from Investigate, and find the domain from Investigate that wasn’t generated, right? Guys? Right…? All 15,000 of them against the 16 found…?
I know I’ve heard “memory is free” before, but there are other things running on these machines and I’m not the patient type. Milliseconds count.
Reversing the DGA
Not reverse engineering the DGA, as that’s already been done. If I have the generated domain, can I find the domain that generated it? Of course! There are always two ways to try to do this.
- Reverse the algorithm itself
- Iterate through all possible inputs and push them through the algorithm until the correct answer pops out
One method will be slower, and reversing the algorithm isn’t always possible. But in this case, it is. I have managed to reverse it to a linear problem, rather than a search of size 26*26*26*26 (or 456,976). However, for the purpose of this example, we’re going to brute force this and pretend there is no other way, as is often the case.
int isBanjoriTail(unsigned char* new) {
int c0,c1,c2,c3;
char[4] d;
for (c0=97;c0<123;c0++) {
for (c1=97;c1<123;c1++) {
for (c2=97;c2<123;c2++) {
for (c3=97;c3<123;c3++) {
d[0]=c0
d[1]=c1;
d[2]=c2;
d[3]=c3;
nextBanjori(d);
if (d[0]==new[0] && d[1]==new[1] && d[2]==new[2] && d[3]==new[3]) {
return 1;
}
}
}
}
}
return 0;
}
Remember, if anything can create the domain, it is not a tail. Otherwise, it is.
To use and modify the code provided by Johannes Bader:
def map_to_lowercase_letter(s):
return ord('a') + ((s - ord('a')) % 26)
def next_domain(domain):
dl = [ord(x) for x in list(domain)]
dl[0] = map_to_lowercase_letter(dl[0] + dl[3])
dl[1] = map_to_lowercase_letter(dl[0] + 2*dl[1])
dl[2] = map_to_lowercase_letter(dl[0] + dl[2] - 1)
dl[3] = map_to_lowercase_letter(dl[1] + dl[2] + dl[3])
return ''.join([chr(x) for x in dl])
def isBanjoriTail(seed):
for c0 in xrange(97,123):
for c1 in xrange(97,123):
for c2 in xrange(97,123):
for c3 in xrange(97,123):
domain = chr(c0)+chr(c1)+chr(c2)+chr(c3)
domain = next_domain(domain)
if seed.startswith(domain):
return False
return True
seeds = {
"nhcisatformalisticirekb.com",
"egfesatformalisticirekb.com",
"qwfusatformalisticirekb.com",
"eijhsatformalisticirekb.com",
"siowsatformalisticirekb.com",
"dhansatformalisticirekb.com",
"zvogsatformalisticirekb.com",
"yaewsatformalisticirekb.com",
"wgxfsatformalisticirekb.com",
"vfxlsatformalisticirekb.com",
"usjssatformalisticirekb.com",
"selzsatformalisticirekb.com",
"nzjqsatformalisticirekb.com",
"kencsatformalisticirekb.com",
"fzkxsatformalisticirekb.com",
"babysatformalisticirekb.com"
}
for seed in seeds:
print seed,isBanjoriTail(seed)
Using this Python script, I averaged around 15 seconds to check the 16 domains and found babysatformalisticirekb.com to be the only tail in the list. Finally! Success. But at what cost?
On with the C
To witness the benefits of using C in Python, I copied the above functions “nextBanjori” and “isBanjoriTail” into banjoriTest.c, and created a corresponding banjoriTest.h header file.
To use the C in Python, another C file must be created that imports python.h to create the Python wrapper for the C code. The Python docs explain how to do this in the extending and embedding section, with all the help building objects in the Python/C API Reference Manual. The wrapper must start the same way, but can return anything as long as it is a Python object.
Following the wrapper
There must be a method definition followed by a initialization function, both for Python to use. I saved the following as banjoriTestPy.c:
#ifndef PYTHON_H
#include<Python.h>
#endif
#ifndef BANJORITEST_H
#include "banjoriTest.h"
#endif
static PyObject* banjori_tail(PyObject* self, PyObject* args) {
const char* domain;
if (!PyArg_ParseTuple(args, "s", &domain)) {
return NULL;
}
int tf = isBanjoriTail((unsigned char*)domain);
if (tf) {
return Py_True;
} else {
return Py_False;
}
}
static PyMethodDef BanjoriMethods[] =
{
{"banjori_tail", banjori_tail, METH_VARARGS, "True or False is this a tail?"},
{NULL, NULL, 0, NULL}
};
PyMODINIT_FUNC
initbanjori(void)
{
(void) Py_InitModule("banjori", BanjoriMethods);
}
The last requirement is a simple Python program that directs Python to compile the C code into a library (I called mine setup_banjoriTestPy.py).
from distutils.core import setup, Extension
module1 = Extension('banjori', sources = ['banjoriTest.c','banjoriTestPy.c'])
setup (name = 'BanjoriTest',
version = '1.0',
description = 'Test to see if a domain is a tail of banjori',
ext_modules = [module1])
Run that program with Python with the argument “build” like this:
python setup_banjoriTestPy.py build
Python compiles the module and places it in a build/lib directory. Copy that file in the same directory as the Python program to use it and it can be imported immediately.
import banjori
seeds = {
"nhcisatformalisticirekb.com",
"egfesatformalisticirekb.com",
"qwfusatformalisticirekb.com",
"eijhsatformalisticirekb.com",
"siowsatformalisticirekb.com",
"dhansatformalisticirekb.com",
"zvogsatformalisticirekb.com",
"yaewsatformalisticirekb.com",
"wgxfsatformalisticirekb.com",
"vfxlsatformalisticirekb.com",
"usjssatformalisticirekb.com",
"selzsatformalisticirekb.com",
"nzjqsatformalisticirekb.com",
"kencsatformalisticirekb.com",
"fzkxsatformalisticirekb.com",
"babysatformalisticirekb.com"
}
for seed in seeds:
print seed,banjori.banjori_tail(seed)
In testing, it took 0.05 seconds to get through all 16 domains. That’s about 300 times faster than the Python version.
An allowable exception
Python is used because of its portability and forgiving nature. Reading someone else’s code is only slightly more difficult than writing it yourself. So is it acceptable to put C code in the mix? Absolutely.
If the C function being wrapped for Python is cohesive, consistent, and could benefit from a speed boost, it is a perfect candidate. Speed is C’s best asset, and modularity is Python’s. High speed C modules for Python? YES. Always yes.
When given the choice between improving our customer’s online security 16 million DNS requests after a threat has been discovered, or 50,000 DNS requests after a threat has been discovered, I only see one option.