2 years ago, when we started studying how DNS records could be used to discover malicious activity, we needed a way to quickly access current and historical data.
3rd party solutions were too slow and too limited to implement Kopis, the very first model we wanted to try.
The logs collected by the resolvers were already copied to HDFS for long-term storage, so a simple way to quickly retrieve records based on a DNS name or associated data was to store additional copies in HBase.
The system turned out to be still too slow to automatically label new domains, but it was good enough to compute an informative reputation score for a specific domain when a user requested it. Thus, the “Security Graph” was born.
A closer look at how the log files are processed
Unlike passive DNS systems, packets are not captured by a dedicated process, but directly written by the resolvers to local log files.
This architecture was easy to build over the existing stack. The number of hops and the late parsing/verification/deduplication stage do not add too much latency: 92% of the records are visible in the ZeroMQ stream less than 30 seconds after related packets have been logged by a resolver.
The purpose of the first staging area is to unload the log files from the resolvers as soon as possible, if only because their local storage space is quite limited. This first staging area is also used to export multiple copies of the log files for different purposes.
The second staging area is actually our main research server, where raw DNS packets are decoded, parsed, verified and deduplicated.
Can a minimal change to the system drastically reduce the latency?
As an alternative to the logfile/rsync/logfile/rsync/logfile pipe, we implemented the ability for the resolvers to serve the logs over a simple ZeroMQ socket. Clients can directly connect to the resolvers and retrieve the data as a feed. This greatly simplifies the architecture, lifts the limitations of on-disk storage and cuts the latency down to a few milliseconds.
This is a big win for our models using only the real-time feed. Not so much for manual investigations and for running graph algorithms.
HBase is a great solution to many problems, but is not a silver bullet. On our constantly overloaded research platform, insertions have to happen in batch, and can only keep up with the data from a small subset of the our resolvers. The latency of read operations is also too high to interactively run graph algorithms.
In addition, our original experiments with generic reputation systems (RIP score, J-Score, SecureRank, C-Rank, Dorothy, Co-occurrences) were implemented as Map/Reduce jobs. These can only run on a small subset of old data, and take nearly 24 hours to complete.
For this reason, they provide little value for discovering and blocking malicious domains before it’s too late. DNS names used by malware are often active only for a few hours or even just for a few minutes.
Our typical workflow involves looking for specific patterns in the authoritative logs stream, monitoring suspicious IP addresses, and milking compromised sites and dedicated rotators. We then want to manually review our findings before blocking or sinkholing them.
More often than not, a query to our database doesn’t return anything. The traffic is displayed as nonexistent, no IP addresses are shown, so no reputation indicators can be displayed either.
At the same time, the DNS name we are investigating actually resolves to a set of IPs that was already associated with many other names with a similar purpose that we just blocked. This information will eventually be available—when we might not need it any more.
As a workaround, our API can automatically resolve a name using a dedicated resolve if no records have been found in the database.
But by the time we manually look up a name, it might not resolve any more, or we might get different or less records than what was actually observed by our public resolvers.
Up-to-date data is available, but is still in-transit.
Buying more hardware is a way to mitigate this issue. But rethinking the system according to how we are actually using it is also an option.
In-memory storage
HBase is fast for random I/O. Memory caching systems are faster.
For web applications, Memcache is frequently used as a simple caching layer for SQL databases. Would it be worth using the same approach for our DNS database?
An initial experiment was conducted using Elliptics Network, mirroring the schema we were already using with HBase.
Is a pure key/value store the best fit? As it happens, we almost exclusively need to:
– Retrieve a set of items given a key: list the records for a given name and a given type, list the names mapping to a given server IP, list the recent queries sent by a given client IP.
– Perform simple operations on sets, such as the intersection of the set of blocklisted names with the names mapping to a given IP.
– Maintain counters, such as the number of suspicious domains observed on an IP address.
– Do graph traversals, without requiring excessive network I/O.
– Perform pattern matching
Based on these constraints, Redis appeared like a perfect candidate.
In order to populate the Redis database, we wrote a daemon that listens to the ZeroMQ stream of authoritative logs, and translates it into Redis queries.
Each record is indexed twice: once as DNAME|t|QTYPE
and once as RR|t|QTYPE``. For example an IPv4 record of
127.0.53.53observed for
www.ovh.will be stored in Redis using two different keys:
www.ovh.tAand
127.0.0.53tA`.
The actual value for these records are Redis sets. This is a compact data structure, and no extra deduplication effort is thus necessary. No range queries are required in order to retrieve records and names given a key. Unions and intersections can be directly handled by Redis. This makes common operations such as “what are the list of distinct names mapping to this list of IPs that we are not blocking yet” a very simple and efficient operation.
Keys are set to expire 24 hours after their most recent update. HBase is not going anywhere and remains where historical data is being stored. However, the API our tools rely on in order to access the DNS database now transparently sends queries both to Redis and to HBase, and merges the results.
Clients get the best of both worlds: historical data with a timeline, as well as fresh data just logged by the resolvers.
Live reputation scores
Redis also offers a very interesting data structure that we previously blogged about: HyperLogLog.
HyperLogLog is incredibly useful to estimate the cardinality of a set, with very low memory requirements.
How about using it to compute reputation scores?
Using the same code base, we wrote another daemon listening to the same input stream, and looking for names that were flagged as suspicious, as well as IPs present in the threat intelligence feeds retrieved by MLSec’s Combine tool. Matching (IP, name) pairs are sent to Redis, the key being the IP and the name being possibly added to the related HyperLogLog structure.
This provides an IP reputation score that gets updated nearly in real time. This reputation score is based on recent evidences rather than historical, now possibly incorrect data.
The key has an expiration of 1 day, but the individual elements of the sets obviously don’t. An IP address being used for malicious purposes will thus get its reputation automatically cleaned up after one day of quarantine. But an IP address being used over a long period of time for malicious activities won’t expire, and will have its reputation getting worse and worse, even if only one malicious name is being observed every day.
In order to investigate live campaigns, this reputation system turns out to be a more useful estimator of the “badness” of an IP address than the bayesian average we previously computed using a daily batch process (“RIP score”).
How about using the same system to compute prefix and ASN reputation scores?
These additional reputation scores were initially implemented to provide additional features to our first classifier. However, they are virtually useless for interactive investigations. Displaying long tables with list of numbers is possibly the worst way to help humans make a decision.
A single number to label a full /24 or ASN space is also way too coarse. It is a low importance feature that can help a classifier correctly label data with features very similar to examples it was trained with, but low importance features are hard to interpret.
Using pipelining, retrieving the value associated to multiple keys in Redis is merely as fast as retrieving the value associated to a single key.
Therefore, given a reference IP address, we can quickly retrieve the individual reputation of all the IP addresses of the same subnet.
Given a DNS name, this makes it easy to show the reputation of not only the IP addresses it resolves to, but also the reputation of the neighborhood for all these addresses.
A single graph summarizes what we know about a name from an IP-based point of view, using our own labels combined with 37 third-party threat intelligence feeds.
Pattern matching
Common exploit kits, as well as many phishing pages, are trivial to track if the DNS names they use follow an identifiable pattern.
Most of what we are blocking today is achieved by grepping the log stream for known IP addresses and known name patterns. Discovering new names following a known pattern helps identifying IP addresses dedicated to malicious activities, which in turn, can be used to discover new names to block even if these do not follow the same pattern.
When these names are used by an exploit kit, we sinkhole some of them in order to reconstruct the infection chain. It is not uncommon for a rotator to serve different exploit kits, or to remain operational after the pattern used by the exploit kit changes. We have been milking the same rotator for over a year to follow the evolution of the Nuclear Exploit Kit.
After a pattern change has been detected, we need to scan our DNS database for matching records, validate the pattern, find all the names that we missed, and see if we can use them to constitute a new set of dedicated IP addresses, name servers and registrants to watch and block.
Redis supports full scans of the key space, with optional pattern matching. While not the fastest operation, it can still scan 45 million keys on a busy server in about 3 minutes without blocking other operations. Looking for ‘paypal‘ returns no less than 1973 names observed during a 24 hours period, most of them smelling really phishy.
Redis as a graph database
Using Redis as a graph database is not uncommon: Pinterest are storing their entire follower graph in Redis and opensource projects such as Related also show that Redis can be an option for dynamic graphs.
The data we have in Redis is a live view of what happened on the Internet over the past 24 hours, as seen from our 23 points of presence.
Raw log files only allowed us to look at fractions of the graph retroactively. But the speed of Redis is a game changer: we can now mine and visualize a graph made of live data and hopefully use it identify treats as they are happening.
Besides being blazing fast, Redis provides two powerful features for achieving this goal: the ability to return random keys (entry points to traverse the graph) and server-side Lua scripting.
Big data, small memory footprint
As the time of writing, the graph representing 24 hours of activity, as well as the IP reputation from 37 threat intelligence feeds in addition to our own system, is made of approximatively 46 million vertices and 174 million edges.
Using Redis, this entire graph requires only 21 Gb memory. Compressed dumps do not exceed 4 Gb.
And the current version doesn’t even try to be memory efficient: keys and values are stored as plain ASCII strings. IP addresses would obviously benefit from binary encoding and a simple Huffman coding of DNS names can also significantly reduce the memory footprint.
Caveats
With the addition of a single virtual machine to our research infrastructure, we significantly improved its performance and its robustness – the Hadoop cluster being temporarily inaccessible is not critical any more. –
However, Redis has a few peculiarities and limitations to be aware of.
- Redis provides two ways to persist data to disk. With different tradeoffs, but a common requirement: memory pages have to be cloned if they have to be modified while a background save operation is in progress. In our case, virtually all the pages get modified before the dump operation completes, meaning that nearly 42 Gb memory are required to save the database without requiring swapping. As an alternative, we tried Ardb. Unfortunately, the
RANDOMKEY
operation not being supported was a showstopper. - Keys expire, individual set elements don’t. For this reason, IPs observed more than 24 hours ago can remain in a set as long as new IP addresses for the same name are observed. The practical implications are negligible and this can be worked around by tagging each element with a coarse timestamp.
Similar to the servers producing the realtime feeds, the servers populating the data into Redis have been written in Rust. Initially, the Redis client library was quite limited, both in term of usability and features. But massive improvements were recently made which resulted in an additional performance boost for our applications.
Exciting times
Our DNS database is unique. By the quality of the data, but also by its features. The realtime feed already allowed watching for specific events and reacting right after a query was received by the OpenDNS resolvers. We now provide the same timeliness and freshness guarantees to ad-hoc queries.
Keeping the entire graph in memory also opens new perspectives. Reputation systems and classifiers leveraging this can drastically help fighting modern malware, characterized by its agility.