Every day, we process terabytes of data in order to spot malicious domains based on their network features and how they are accessed.

Our live dataset comes from two major sources:
– The log files of queries sent by users to our resolvers
– The log files of queries sent by our resolvers to authoritative servers.

In this blog post, we will focus on the latter, highlighting how we use it to block suspicious domains immediately after they show up in our log files.

A not-quite-passive DNS database

Every time a resolver needs to answer a question that it hasn’t seen before, or that is no longer present in the cache, it must recursively query authoritative DNS servers before eventually being able to forward the final response to a client.

We are logging every single packet received by our resolvers from authoritative servers. This lets us keep historical data on all the domain names that we received queries for.

This kind of database is often referred to as a “Passive DNS database,” but our system works in a different way, and it’s not quite passive. We are not running sensors: the resolvers themselves are directly logging the responses they receive from authoritative servers.

Such a database lets us answer questions like, “What are all the IP addresses that example.com has resolved to in the past 90 days?” and, “What are all the domain names that are using ns1.example.com as an authoritative server?”

Building the database

Records received from authoritative servers are highly redundant, because the majority of records have a short TTL even though their content doesn’t actually change much. The edns-clientsubnet extension also triggers a new upstream query for each client subnet, even though many different subnets are actually going to share the same response. Removing duplicate records is thus an essential preliminary step to build our DNS database. This step drastically reduces the amount of data to store: on average, out of 241 log records, we actually only store one.

We use a Bloom filter in order to remove duplicate records without having to sort them. Hash functions used to build this Bloom filter are created by using SipHash-2-4 with pseudorandomly-generated keys, and the keys are rotated after each batch of data. Thanks to this trick, we can use very small bitmaps without having to worry too much about false positives: if a name is being reported as present in the set while it shouldn’t, it is very unlikely to happen again after a key rotation. Using a secure pseudorandom function also prevents attackers from intentionally triggering false positives.

The output of this deduplication filter is eventually stored as Hadoop HDFS files, and finally into HBase for ad-hoc queries. We sequentially run a dozen Hadoop jobs every day on this data in order to compute different reputation scores for IP addresses and domain names. This eventually lets us find domain names that need to be manually reviewed, or, when combined with the output of other models and third-party services, can be automatically blocked.

The need for real time processing

Running algorithms once a day on the data is clearly suboptimal. If nytimes.com DNS records are hijacked, we need to spot this as soon as possible to protect our customers, not the next day. Furthermore, domain names that serve exploits are also typically only in use for a short period of time. We want to block them while they are still active, and as soon as possible, not after the baton has been passed to another domain.

Enter ZeroMQ

ZeroMQ is a popular, battle-tested message transport protocol and networking library, designed for very low and predictable latency, high throughput, and high reliability.

The ZeroMQ library implements, among other things, the traditional pub/sub pattern: a “producer” generates a stream of data that any number of local or remote clients can simultaneously connect to, in order to receive live updates.

After the deduplication process, and in addition to storing the output into HDFS and Hadoop, we are now streaming this data to a ZeroMQ socket.

This brings a lot of benefits:
– Any authorized machine can join and leave the feed, anytime. This allows for instant testing and parallel processing without any setup. Need to quickly look for domain names matching a specific pattern? That can be done directly on a researcher’s laptop.
– Security: ZeroMQ supports strong encryption and certificate-based authentication out of the box, thanks to libsodium.
– Low CPU impact: a single machine can effortlessly consume our stream of preprocessed authoritative log data.
– Low latency: the data is immediately available for consumption.
– No API required: all it takes is a host name and a port number. Just connect to the socket, and you will start receiving formatted data. The ZeroMQ protocol is widely used, and there are readily available client libraries for more than 40 programming languages, as well as plugins for common tools like Splunk.

As soon as a client connects, it gets a live stream of JSON objects that can be processed right away.


Screen Shot 2013-10-02 at 2.50.16 PM

Building streams out of streams

A simple use case of this stream is keeping track of new domain names, or rather domain names that we didn’t see traffic for before, or that didn’t resolve to any IP address until now. In order to do so, we once again use Bloom filters that keep track of unique domain names. To provide a sliding window, we simply use a ring buffer of seven Bloom filters, that we are shifting once the most recent filter gets more than one day old, or holds more than 25 million entries. The output of this consumer is another ZeroMQ stream, that we can use to inspect new web sites as soon as they are discovered.

In addition to tracking new domain names, we simultaneously run another consumer tracking new (domain name, IP) tuples.

Our IP reputation systems

Every day, we run three Hadoop jobs to assign reputation scores to IP addresses. The first is a bayesian average of the number of known malicious domains found on a given IP address. The second is the secure rank score. The third is based on the amount of “disposable” domains that an IP address has been hosting compared to the amount of stable domains seen on the same IP. (We’ll discuss this score in more detail in an upcoming post.)

We use these IP reputation systems to build lists of IP addresses that have been serving a lot of domain names, all of them being known as controlled by cybercriminals.

Putting the pieces together

Since we already have a stream of new (domain name, IP) tuples, new domain names resolving to one of the highly suspicious IP addresses can be immediately blocked. While the ZeroMQ library itself is fast and provides latency guarantees, the actual producer and consumer code also needs to be equally efficient to process the data at the same rate as it is received.

This was a good opportunity to try Rust, a modern programming language by Mozilla Research aiming at being a safe replacement for C++. Minor changes to the ZeroMQ bindings had to be done in order to make them compatible with the latest Rust version. But overall, our experience with Rust has been absolutely amazing.

We contributed our changes to the ZeroMQ bindings and open sourced our Bloom Filter implementation on Github.

Towards a more real time architecture

When it comes to blocking malware, every second counts. A speedy resolution is the only way to limit the number of compromised machines, so models based on stream processing are insanely useful.

We are not going to get rid of our pink friend any time soon,but he just got a new buddy.

This post is categorized in: