Previously, we introduced our real-time API, and Senior Research Scientist Ping Yan recently blogged about how she used it to find Black Friday scams.
The data feed, described in the post mentioned above, is constantly consumed by multiple processors or stream interpreters. In this blog post, we will focus on one processor dedicated to spotting a specific category of suspicious IP addresses.
It is uncommon for an IP address to suddenly have many new domain names map to it, where there was none prior. Of course a hosting service, a load-balancing service, a CDN or a user moving a lot of domains to a new server can follow this pattern, but benign cases are both infrequent and relatively easy to distinguish from suspicious activities.
In our research, we define an IP address as being “dormant” if less than N names mapping to it have been observed in the past 7 days, and as “hyperactive” if more than M names mapping to it have been observed during the past 4 hours.
One stream we generate is a list of recently observed pairs (name, IP address). This stream is a perfect candidate for our task.
{"asn":30962,"name":"dentro.de.","owner":"dentro.de.","rr":"62.108.32.81","server_ip":"82.115.108.50","ts":1386104400,"ttl":3600,"type":"A"} {"asn":8972,"name":"www.benm.at.","owner":"benm.at.","rr":"80.86.80.177","server_ip":"193.46.215.55","ts":1386104400,"ttl":900,"type":"A"} {"asn":25847,"name":"model-trains-store.com.","owner":"model-trains-store.com.","rr":"64.64.3.139","server_ip":"64.64.3.136","ts":1386104400,"ttl":14400,"type":"A"} {"asn":8685,"name":"www.engin.tv.","owner":"engin.tv.","rr":"213.155.113.195","server_ip":"212.58.3.7","ts":1386104400,"ttl":600,"type":"A"} {"asn":29648,"name":"info-03.surgutneftegas.ru.","owner":"surgutneftegas.ru.","rr":"77.233.191.6","server_ip":"83.149.32.2","ts":1386104400,"ttl":3600,"type":"A"} {"asn":20485,"name":"info-03.surgutneftegas.ru.","owner":"surgutneftegas.ru.","rr":"62.33.202.6","server_ip":"83.149.32.2","ts":1386104400,"ttl":3600,"type":"A"} {"asn":3462,"name":"36-233-153-101.dynamic-ip.hinet.net.","owner":"dynamic-ip.hinet.net.","rr":"36.233.153.101","server_ip":"168.95.1.19","ts":1386104400,"ttl":86400,"type":"A"} {"asn":20773,"name":"www.electronic-thingks.de.","owner":"electronic-thingks.de.","rr":"83.169.26.138","server_ip":"80.237.128.10","ts":1386104400,"ttl":86400,"type":"A"} {"asn":9198,"name":"89.218.160.130.metro.online.kz.","owner":"metro.online.kz.","rr":"89.218.160.130","server_ip":"212.19.149.53","ts":1386104400,"ttl":86400,"type":"A"}
However, keeping track of all the names observed for all the IPs observed can require quite a lot of memory, especially when all we need is a bunch of counters.
Furthermore, these counters do not have to be accurate. When an IP address becomes “hyperactive,” new names are usually piling up at a very high rate, so the IP will eventually be labeled.
Instead of keeping track of individual domain names that mapped to each IP, we use the HyperLogLog algorithm that we ported to the Rust programming language.
The beauty of this algorithm is that the complexity and memory usage remain constant no matter how many elements are in the set.
Our stream processor keeps an in-memory set of IPs, and for each IP, two HyperLogLog estimators.
The former (“current”) estimates the number of names recently observed for a given IP. The latter (“archive”) estimates the number of names observed more than 4 hours ago.
When a new entry for an IP is read from the stream, we check the age of the “current” estimator. If this estimator has been in use for more than 4 hours, we merge the content of this estimator to the one dedicated to archival and reset the “current” estimator.
Thanks to the HyperLogLog algorithm, merging is a very fast and constant-time operation.
In order to detect hyperactive IPs that recently transitioned from being dormant, the stream processor estimates the cardinality of each IP using the “archive” estimator, then the cardinality of the same IP using the “current” estimator. If the former is below N (which we empirically set to 3) and the latter above or equal to M (currently 10), we print the current cardinality, the name and the IP:
88 5fd40.93taotao.com. 23.104.41.152 52 2l7d9.jjrnp.com. 23.244.38.15 153 14q3f.wzstorm.com. 23.244.38.77 107 shishicaizuiyizhongjiangdewanfa.gzhsfisher.com. 23.235.132.36 71 qo73p.yqhxnhcl.com. 172.246.178.62 95 mianfeiqipaiyouxipingtai.gzaqgy.com. 23.244.57.126 136 35441.dlyjzs.com. 23.244.38.85 46 ppyulechengwangzhandizhishishime.5udate.com. 173.234.231.103 99 ouzhoubeijuesai.axcych58.com. 23.244.57.92 45 gongjihuichengyuan.jjkho.com. 5.226.171.35 12 overlay.ringtonematcher.com. 216.137.55.127 46 i-mhow.com. 141.101.117.162
Sorting recent entries of this new stream yields domain names mapping to the most hyperactive IPs:
571 sge.su 553 sxo.su
These domains happen to be currently used by the Caphaw trojan.
Filtering by name patterns and TTLs immediately shows more interesting domains (listed below) being used by the Nuclear exploit pack:
81 thinkmetal.biz 46 cosmogift.biz 37 lightcasa.biz 36 movieprice.biz 32 moviehello.biz 31 timequality.biz 31 infoobesity.biz 31 comwin.biz 30 flypanda.biz 26 expertsurvey.biz 20 eurosync.biz 18 spymac.biz 18 sharerebel.biz 16 cybervirtual.biz 10 drcoupon.biz
These domains can be active for a very short period of time, so blocking them as fast as possible is critical.
To put all this in context, the OpenDNS Security Graph is centered on the concept of being fast, predictive, and adaptive. We want to block malware and botnets before they even manifest themselves as a problem. The real-time API, and the stream processors built on it, allow us to react very quickly, even before the data is recorded in our databases. Sketching algorithms such as HyperLogLog make that possible on big data, with little effort, little hardware, and low latency.