• Skip to primary navigation
  • Skip to main content
  • Skip to footer

Cisco Umbrella

Enterprise network security

  • Contact Sales
  • Login
    • Umbrella Login
    • Cloudlock Login
  • Why Us
    • Why Cisco Umbrella
      • Why Try Umbrella
      • Why DNS Security
      • Why Umbrella SASE
      • Our Customers
      • Customer Stories
      • Why Cisco Secure
    • Fast Reliable Cloud
      • Global Cloud Architecture
      • Cloud Network Status
      • Global Cloud Network Activity
    • Unmatched Intelligence
      • A New Approach to Cybersecurity
      • Interactive Intelligence
      • Cyber Attack Prevention
      • Umbrella and Cisco Talos Threat Intelligence
    • Extensive Integrations
      • IT Security Integrations
      • Hardware Integrations
      • Meraki Integration
      • Cisco Umbrella and SecureX
  • Products
    • Cisco Umbrella Products
      • Cisco Umbrella Cloud Security Service
      • Recursive DNS Services
      • Cisco Umbrella SIG
      • Umbrella Investigate
      • What’s New
    • Product Packages
      • Cisco Umbrella Package Comparison
      • – DNS Security Essentials Package
      • – DNS Security Advantage Package
      • – SIG Essentials Package
      • – SIG Advantage Package
      • Umbrella Support Packages
    • Functionality
      • DNS-Layer Security
      • Secure Web Gateway
      • Cloud Access Security Broker (CASB)
      • Cloud Data Loss Prevention (DLP)
      • Cloud-Delivered Firewall
      • Cloud Malware Protection
      • Remote Browser Isolation (RBI)
    • Man on a laptop with headphones on. He is attending a Cisco Umbrella Live Demo
  • Solutions
    • SASE & SSE Solutions
      • Cisco Umbrella SASE
      • Secure Access Service Edge (SASE)
      • What is SASE
      • What is Security Service Edge (SSE)
    • Functionality Solutions
      • Web Content Filtering
      • Secure Direct Internet Access
      • Shadow IT Discovery & App Blocking
      • Fast Incident Response
      • Unified Threat Management
      • Protect Mobile Users
      • Securing Remote and Roaming Users
    • Network Solutions
      • Guest Wi-Fi Security
      • SD-WAN Security
      • Off-Network Endpoint Security
    • Industry Solutions
      • Government and Public Sector Cybersecurity
      • Financial Services Security
      • Cybersecurity for Manufacturing
      • Higher Education Security
      • K-12 Schools Security
      • Healthcare, Retail and Hospitality Security
      • Enterprise Cloud Security
      • Small Business Cybersecurity
  • Resources
    • Content Library
      • Top Resources
      • Cybersecurity Webinars
      • Events
      • Research Reports
      • Case Studies
      • Videos
      • Datasheets
      • eBooks
      • Solution Briefs
    • International Documents
      • Deutsch/German
      • Español/Spanish
      • Français/French
      • Italiano/Italian
      • 日本語/Japanese
    • Security Definitions
      • What is Secure Access Service Edge (SASE)
      • What is Security Service Edge (SSE)
      • What is a Cloud Access Security Broker (CASB)
      • Cyber Threat Categories and Definitions
    • For Customers
      • Support
      • Customer Success Webinars
      • Cisco Umbrella Studio
  • Trends & Threats
    • Market Trends
      • Hybrid Workforce
      • Rise of Remote Workers
      • Secure Internet Gateway (SIG)
    • Security Threats
      • How to Stop Phishing Attacks
      • Malware Detection and Protection
      • Ransomware is on the Rise
      • Cryptomining Malware Protection
      • Cybersecurity Threat Landscape
      • Global Cyber Threat Intelligence
    •  
    • Woman connecting confidently to any device anywhere
  • Partners
    • Channel Partners
      • Partner Program
      • Become a Partner
    • Service Providers
      • Secure Connectivity
      • Managed Security for MSSPs
      • Managed IT for MSPs
    •  
    • Person looking down at laptop. They are connecting and working securely
  • Blog
    • News & Product Posts
      • Latest Posts
      • Products & Services
      • Customer Focus
      • Feature Spotlight
    • Cybersecurity Posts
      • Security
      • Threats
      • Cybersecurity Threat Spotlight
      • Research
    •  
    • Register for a webinar - with illustration of connecting securely to the cloud
  • Contact Us
  • Umbrella Login
  • Cloudlock Login
  • Free Trial
Research

Discovering Malicious Domains Using Co-Occurrences

Author avatar of Security Research TeamSecurity Research Team
Updated — March 5, 2020 • 6 minute read
View blog >

[load-javascript slug=”mathjax”]

The infection chain for serving a single piece of malware is frequently made of many, constantly-changing domains. The security community finds thousands of new sites serving malware or acting as intermediaries every day.  Hosts used to control botnets are also constantly changing in order to be resilient to takedowns.

In this context, we need to discover and block new suspicious domains as soon as possible. In order to do so, we use different models, each of them capturing different sets of domains. Once we have evidence of a server distributing malware or acting as a command-and-control server, the first thing we usually do is try to find other domains used, or soon-to-be-used, by the same malware.

From DNS queries to a discovery algorithm

Let’s take a look at the data we have, and how we use it, from a given domain name, to discover new and possibly related domains. 

The log files we get from DNS resolvers are unstructured text files, containing the responses we send to clients (in this snippet, IPs have been made up):

2013-07-18 13:59:26.397060500 9.2 74.234.12.159 74.234.12.159 208.67.222.222 normal 0 - www.youtube.com. 1 0 - 0 800000 0 com.youtube m1.wrw
2013-07-18 13:59:26.397062500 9.2 74.234.14.200 74.234.14.200 208.67.222.222 normal 0 - dnl-01.geo.kaspersky.com. 1 0 - 0 102000000 0 com.kaspersky m1.wrw
2013-07-18 13:59:26.397063500 9.2 74.231.231.2 74.231.231.2 208.67.222.222 normal 0 - jelcz.sexibl.com. 1 0 - 0 0 0 com.sexibl m1.wrw
2013-07-18 13:59:26.397065500 9.2 47.136.196.21 47.136.196.21 208.67.222.222 normal 0 - outerlink.com. 15 0 - 0 0 0 com.outerlink m1.wrw
2013-07-18 13:59:26.397066500 9.2 93.47.212.8 93.47.212.8 208.67.222.222 normal 0 - aol.com. 15 0 - 0 80A40000 0 com.aol m1.wrw
2013-07-18 13:59:26.397085500 9.2 91.74.34.126 91.74.34.126 208.67.222.222 normal 0 - img.zszywka.pl. 1 0 - 0 0 0 pl.zszywka m1.wrw
2013-07-18 13:59:26.397086500 9.2 122.162.60.1 122.162.60.1 208.67.222.222 normal 0 - rtb.metrigo.com. 1 0 - 0 0 0 com.metrigo m1.wrw
2013-07-18 13:59:26.397087500 9.2 213.199.198.182 213.199.198.182 208.67.220.220 normal 0 - www.ogame.com.ar. 1 0 - 0 800 0 ar.com.ogame m1.wrw
2013-07-18 13:59:26.397088500 9.2 213.199.198.182 213.199.198.182 208.67.222.222 normal 0 - www.ogame.com.ar. 1 0 - 0 800 0 ar.com.ogame m1.wrw
2013-07-18 13:59:26.397092500 9.2 176.111.228.151 176.111.228.151 208.67.222.222 normal 0 - osne.ws. 1 0 - 0 0 0 ws.osne m1.wrw
2013-07-18 13:59:26.397093500 9.2 92.61.45.119 92.61.45.119 208.67.222.222 normal 0 - alt3.gmail-smtp-in.l.google.com. 1 0 - 0 800000 0 com.google m1.wrw
2013-07-18 13:59:26.397094500 9.2 47.136.211.87 47.136.211.87 208.67.220.220 normal 0 - mx.dca.untd.com. 1 0 - 0 102000000 0 com.untd m1.wrw
2013-07-18 13:59:26.397099500 9.2 47.136.211.87 47.136.211.87 208.67.222.222 normal 0 - yahoo.com. 15 0 - 0 800000 0 com.yahoo m1.wrw
2013-07-18 13:59:26.397100500 9.2 93.127.85.16 93.127.85.16 208.67.222.222 normal 0 - mx-1.naver.com. 1 0 - 0 1000840000 0 com.naver m1.wrw
2013-07-18 13:59:26.397101500 9.2 91.253.53.128 91.253.53.128 208.67.220.220 normal 0 - 2.s.dziennik.pl. 1 0 - 0 40000 0 pl.dziennik m1.wrw
2013-07-18 13:59:26.397102500 9.2 74.234.15.41 74.234.15.41 208.67.220.220 normal 0 - ssl.gstatic.com. 1 0 - 0 0 0 com.gstatic m1.wrw
2013-07-18 13:59:26.397105500 9.2 37.47.206.18 37.47.206.18 208.67.222.222 normal 0 - nickel.champlain.edu. 1 0 - 0 200000000 0 edu.champlain m1.wrw

Basically, the only information we have for each query is an approximate timestamp, a client IP address, a query type and the name.

Unlike web sites, there are no explicit links between resources. And unlike the HTTP protocol, the DNS protocol doesn’t provide any tracking information like refer(r)ers or cookies.

An intuition, though, is that temporal proximity can be used to predict how related two domain names might be.

Cleaning the data

While data preparation might not sound like the most exciting step of an algorithm, it actually plays a critical role in this case.

Remember that we don’t have any reliable way to identify the device that sent a query. All we see is an IP address. But home users are frequently assigned a dynamic IP address.

In addition, many devices sitting behind a router can share a single external IP address. In this case, we can see many queries within a short time window, that are totally unrelated to each other.

We are also observing client IP addresses sending a lot of queries in a row for a small number of domains, which can introduce bias.

The first thing we do to mitigate these cases is remove entries from client IPs that have sent more queries than 99.9% of clients IPs. The assertion here is these client IPs are likely to be used by many devices at the same time. They might also be the target of a DNS amplification attack. Ignoring them improves the performance of our algorithm.

We then remove duplicate (client ip, name) tuples, keeping only the most recent entries.

This ensures that a pair of names can’t be seen more than once for a given IP, and reduces the number of queries for a single hour from 4 billion down to around 400 million.

We also remove queries for invalid domain names, accounting for 2% of the unique (client ip, query) tuples.

Temporal proximity of related malicious domains

We need to validate our intuition that DNS queries for related domains might be frequently observed in a small time window.

Let (M) be the set of domain names that we already flagged as malicious, and (t_i(c)) the timestamp of the first query sent by a client (c) for the domain (i).

For each client, we find all the unique pairs of domains ((i,j)) so that (i,j in M^2) and (i < j). We then compute the time difference (left| t_i(c)-t_j(c) right|) which can’t be zero due to the architecture of our DNS resolvers.

The histogram of (left| t_i(c)-t_j(c) right|) for all clients and all pairs of malicious domains appears to be gamma distributed, with a shape of 0.56 and a scale of 134.65, for timestamps expressed in seconds.

dist

The probability for two malicious domains looked up by a client (c) to be related can thus be expressed as [f_c(i,j) = frac{0.04 e^{-0.007 left| t_c(i)-t_c(j)right| }}{left| t_c(i)-t_c(j)right| {}^{0.436}} ]

The co-occurrence score

Let (D_c) be the set of domain names looked up by a client (c).

In order to compute a global co-occurrence score, we define (g(i,j)) as the summation of (f_c(i,j)) for every client (c) having sent queries both for (i) and for (j). If no clients have sent a query for both (i) and (j), (g(i,j) = 0).

For any given malicious domain, we could use this function to find other domains sharing the same web pages, the same infection chain, or serving the needs of the same malware sample.

Unfortunately, this doesn’t work very well in practice.

It’s not uncommon for malicious web sites to load resources from third-parties, like banners and social network widgets.
Furthermore, users that are infected or in the process of being infected keep reading their email, and browsing common web services.

For example, queries for google-analytics.com, yieldmanager.com, scorecardresearch.com, msftncsi.com and apple.com are frequently seen around the same time as queries for unrelated domains, including malicious ones.

And the high co-occurrence score of these domains paired with others doesn’t help us discover new domains names that we should take a closer look at.

We thus refine the function to lower the score for ((i,j)) if (j) happens to be a domain requested by a lot of client IPs.

Let (D) be the set of all domain names observed.

[ s(i,j) = frac{g(i,j)}{sum _{kin D} g(k,j)} ]

The actual score we use is normalized:

[ s'(i,j) = frac{g(i,j)}{left(sum _{kin D} g(k,j)right) sum _{kin D} s(i,k)} ]

Two case studies

A few months ago, suspicious queries for www1.hsbc.ca caught our attention.

This name didn’t resolve, never did, and we suddently saw a spike of traffic for it.

Like many .ca domain names, hsbc.ca usually gets traffic coming from clients in Canada, US and Great Britain, but what we observed for www1.hsbc.ca was also unusual:

Screen Shot 2013-07-22 at 4.13.42 PM

The highest co-occurrence scores for domains paired with www.hsbc.ca were:

Screen Shot 2013-07-22 at 4.16.32 PM

A new DGA pattern was clearly emerging here.

Diving into the co-occurrences for these DGA domains unveiled many more domains following the same pattern.

These domains happened to be C&C domains for the W32.Xpiro.D malware.

More recently, a slew of weight loss spam hit our mailboxes. Given only one domain name, we were able to figure out more, even if they happen to be hosted on different IPs:

Screen Shot 2013-07-22 at 4.41.29 PM

By looking further at the co-occurrences, we also found the SEO/Social Media Marketing company which is very likely to be responsible for this campain.

Screen Shot 2013-07-22 at 5.01.56 PM

More use for the co-occurrence score

The co-occurrence score has proven to be very useful in order to discover new domain names from domain names that we already know to be malicious.

But it is also the foundation for another algorithm that can lead to new malicious domains without having to explicitly provide a starting point. This algorithm will be described in detail in a forthcoming blog post.

Suggested Blogs

  • Cloud Application Security – Risks, Questions, Insights, and Solutions July 1, 2021 3 minute read
  • Cisco Umbrella discovers evolving, complex cyberthreats in first half of 2020 August 18, 2020 6 minute read
  • New research shows consumers want cybersecurity from service providers July 7, 2020 4 minute read

Share this blog

FacebookTweetLinkedIn

Follow Us

  • Twitter
  • Facebook
  • LinkedIn
  • YouTube

Footer Sections

What we make

  • Cloud Security Service
  • DNS-Layer Network Security
  • Secure Web Gateway
  • Security Packages

Who we are

  • Global Cloud Architecture
  • Cloud Network Status
  • Cloud Network Activity
  • OpenDNS is now Umbrella
  • Cisco Umbrella Blog

Learn more

  • Webinars
  • Careers
  • Support
  • Cisco Umbrella Live Demo
  • Contact Sales
Umbrella by Cisco
208.67.222.222+208.67.220.220
2620:119:35::35+2620:119:53::53
Sign up for a Free Trial
  • Cisco Online Privacy Statement
  • Terms of Service
  • Sitemap

© 2023 Cisco Umbrella