• Skip to primary navigation
  • Skip to main content
  • Skip to footer

Cisco Umbrella

Enterprise network security

  • Contact Sales
  • Login
    • Umbrella Login
    • Cloudlock Login
  • Why Us
    • Why Cisco Umbrella
      • Why Try Umbrella
      • Why DNS Security
      • Why Umbrella SASE
      • Our Customers
      • Customer Stories
      • Why Cisco Secure
    • Fast Reliable Cloud
      • Global Cloud Architecture
      • Cloud Network Status
      • Global Cloud Network Activity
    • Unmatched Intelligence
      • A New Approach to Cybersecurity
      • Interactive Intelligence
      • Cyber Attack Prevention
      • Umbrella and Cisco Talos Threat Intelligence
    • Extensive Integrations
      • IT Security Integrations
      • Hardware Integrations
      • Meraki Integration
      • Cisco Umbrella and SecureX
  • Products
    • Cisco Umbrella Products
      • Cisco Umbrella Cloud Security Service
      • Recursive DNS Services
      • Cisco Umbrella SIG
      • Umbrella Investigate
      • What’s New
    • Product Packages
      • Cisco Umbrella Package Comparison
      • – DNS Security Essentials Package
      • – DNS Security Advantage Package
      • – SIG Essentials Package
      • – SIG Advantage Package
      • Umbrella Support Packages
    • Functionality
      • DNS-Layer Security
      • Secure Web Gateway
      • Cloud Access Security Broker (CASB)
      • Cloud Data Loss Prevention (DLP)
      • Cloud-Delivered Firewall
      • Cloud Malware Protection
      • Remote Browser Isolation (RBI)
    • Man on a laptop with headphones on. He is attending a Cisco Umbrella Live Demo
  • Solutions
    • SASE & SSE Solutions
      • Cisco Umbrella SASE
      • Secure Access Service Edge (SASE)
      • What is SASE
      • What is Security Service Edge (SSE)
    • Functionality Solutions
      • Web Content Filtering
      • Secure Direct Internet Access
      • Shadow IT Discovery & App Blocking
      • Fast Incident Response
      • Unified Threat Management
      • Protect Mobile Users
      • Securing Remote and Roaming Users
    • Network Solutions
      • Guest Wi-Fi Security
      • SD-WAN Security
      • Off-Network Endpoint Security
    • Industry Solutions
      • Government and Public Sector Cybersecurity
      • Financial Services Security
      • Cybersecurity for Manufacturing
      • Higher Education Security
      • K-12 Schools Security
      • Healthcare, Retail and Hospitality Security
      • Enterprise Cloud Security
      • Small Business Cybersecurity
  • Resources
    • Content Library
      • Top Resources
      • Cybersecurity Webinars
      • Events
      • Research Reports
      • Case Studies
      • Videos
      • Datasheets
      • eBooks
      • Solution Briefs
    • International Documents
      • Deutsch/German
      • Español/Spanish
      • Français/French
      • Italiano/Italian
      • 日本語/Japanese
    • Security Definitions
      • What is Secure Access Service Edge (SASE)
      • What is Security Service Edge (SSE)
      • What is a Cloud Access Security Broker (CASB)
      • Cyber Threat Categories and Definitions
    • For Customers
      • Support
      • Customer Success Webinars
      • Cisco Umbrella Studio
  • Trends & Threats
    • Market Trends
      • Hybrid Workforce
      • Rise of Remote Workers
      • Secure Internet Gateway (SIG)
    • Security Threats
      • How to Stop Phishing Attacks
      • Malware Detection and Protection
      • Ransomware is on the Rise
      • Cryptomining Malware Protection
      • Cybersecurity Threat Landscape
      • Global Cyber Threat Intelligence
    •  
    • Woman connecting confidently to any device anywhere
  • Partners
    • Channel Partners
      • Partner Program
      • Become a Partner
    • Service Providers
      • Secure Connectivity
      • Managed Security for MSSPs
      • Managed IT for MSPs
    •  
    • Person looking down at laptop. They are connecting and working securely
  • Blog
    • News & Product Posts
      • Latest Posts
      • Products & Services
      • Customer Focus
      • Feature Spotlight
    • Cybersecurity Posts
      • Security
      • Threats
      • Cybersecurity Threat Spotlight
      • Research
    •  
    • Register for a webinar - with illustration of connecting securely to the cloud
  • Contact Us
  • Umbrella Login
  • Cloudlock Login
  • Free Trial
Research

Elasticsearch: You Know, For Logs [Part 3]

Author avatar of UmbrellaEngineeringUmbrellaEngineering
Updated — October 15, 2020 • 6 minute read
View blog >

Part 3: Searching and Sorting

In Part 2 of this series, Elasticsearch proved that it could be scaled up with relative ease using different instance roles in order to balance the workload. Additionally, Elasticsearch showed that, given proper setup and configuration, it can handle several different failure cases while maintaining high availability.

This post will focus on how to optimize searching and sorting in Elasticsearch for log data.

Log Search

By default, Elasticsearch expects to be querying full text documents, not log data. As a result, Elasticsearch will score and sort each hit based on a relevancy factor since a single document could have more than one occurrence of the search term. This functionality can be avoided when searching log data by using a “‘filtered”’ query.

For example, the following query uses a filter that doesn’t score its hits:

curl -XGET localhost:9200/_search -d '
{
"size" : 10,
  "query": {
    "filtered": {
      "filter": {
        "term": { "URL": "opendns.com/enterprise-security"}
      }
    }
  }
}
'

This query will return the first 10 documents matching the filter without having to score and sort each document.

Sorted Queries

Logs often need to be returned in a sorted order. Since documents cannot be stored pre-sorted in Elasticsearch, sorting becomes a fairly challenging problem that can greatly slow down queries. Sorting is achieved in Elasticsearch by adding a “sort” clause to a query. Hits of a sorted query will not be scored.

Example:

curl -XGET localhost:9200/_search -d '
{
"sort" : [
    { "IP Address" : {"order" : "asc"},
     "Timestamp" : { "order" : "desc" }
    }
],
  "query": {
    "filtered": {
      "filter": {
        "term": { "URL": "opendns.com/enterprise-security"}
      }
    }
  }
}
'

The query above will return all logs with the URL field “opendns.com/enterprise-security,” sorted in ascending order by IP address. Then any logs with matching IP address fields are sorted in descending order by their timestamp field. Elasticsearch will not score any of the documents matching this query since a different sorting field is given.

Without proper configuration, a sorted query could result in a massive increase in response time. This is because no matter the size of your return set, every document that matches the query must be sorted. Even if only the top two documents are asked to be returned, Elasticsearch must sort every document that matches the query to actually determine the top two. This means that a query will have to sort the same number of values regardless of how many documents it actually returns.

The default sorting process in Elasticsearch is simple. If Elasticsearch is asked to return the top n hits of a sorted query, it will do the following:

  • First each document that matches the query will be found as a “hit.”
  • Then, for every hit, each shard loads all the data from the field to be sorted into memory.
  • The fields will then be sorted and the top n hits will be returned from each shard.
  • Then, the top hits from each shard will be aggregated and re-sorted to ensure accuracy of the results for each index.
  • Then, the top hit from each index will be aggregated and re-sorted to ensure accuracy of the final results.
  • Finally the top n hits from the cluster are returned.

This process is slow and unavoidable, but Elasticsearch has a some options to speed it up.

Field Data

Elasticsearch stores all the values of the fields to be sorted on in its Field Data Cache. The problem is pulling field data into the cache at the time of the request is a heavy operation since all this data is read from disk. This behavior can be avoided by maintaining a “Field Data Cache.”

Fortunately, Elasticsearch allows field data to be loaded into the cache at the indexing stage by “eagerly” loading the field data. For example, the following template will eagerly load timestamp data for all type ‘log’ documents in “dnslog-*’ indices:

curl -XPUT localhost:9200/_template/dnslog_template -d '
{
    "template" : "dnslog-*",
    "settings" : {
        "number_of_shards" : 3
      "number_of_replicas": 1
    },
    "mappings" : {
        "log" : {
            "properties" : {
                 "Timestamp" : {"type" : "date",
                             "fielddata": {
                                 "loading" : "eager"
                                 }
                            },
                "URL" : {"type" : "string",
                           "index": "not_analyzed"},
                "IP Address" : {"type": "ip"}
            }
        }
    }
}
'

The size of the field data cache is configured by two variables in the elasticsearch.yml file:

indices.fielddata.cache.expire: 1d
indices.fielddata.cache.size: 25%

Eagerly loading field data can fill memory quickly if it isn’t controlled. These settings give you options for expiring and limiting field data. In most real time search applications, recent data will receive by far the most traffic. So setting the indices.fielddata.cache.expire to 1d will ensure recently indexed data can be sorted quickly, while also ensuring that field data from older data does not fill up memory. In the case where the field data cache does run out of memory, it will evict data using the Least Recently Used Algorithm if necessary.

Doc Values

Although using in-memory field data is the fastest way to sort, field data can consume a lot of memory. For large enough volumes of data, maintaining an in-memory field data cache may significantly slow down the application or simply be unfeasible. Luckily Elasticsearch provides an alternative to in-memory field data called “Doc Values”. Doc Values is a field data format, essentially the same as in-memory field data, except that it is stored on disk instead of in memory. Enabling Doc Values means more memory for other operations and less garbage collections while only causing a small increase in sorting speed (10-25% according to Elastic).

Doc Values is enabled by changing the field data format to “doc_values” in the mapping for the field(s) that will be sorted on. For example:

"log" : {
            "properties" : {
                 "Timestamp" : {"type" : "date",
                              "fielddata": {
                                 "format": "doc_values"
                                 }
                             },
                 "URL" : {"type" : "string",
                            "index": "not_analyzed"},
                 "IP Address" : {"type": "ip"}
            }
        }

Once the field data format is set to “doc_values”, Elasticsearch will automatically store the field data on disk for each document when it is indexed. It will also automatically start using the on-disk field data for sorting.

Minimize Sorting

Another option that can potentially reduce the response time of sorted queries is to take advantage of your index schema. For example if indices are partitioned by time period, such as by day or by hour, and results are being sorted on the Timestamp field, there is no need to sort every hit.  It is far more efficient to first query the most recent index, only moving on to the next index if the number of hits of the first query is smaller than the desired number of results. Applying this technique ensures the minimum number of fields are sorted.

This sorting technique is not supported by Elasticsearch itself and thus must be implemented by the Elasticsearch Client application. This can be done using a simple “for loop”, taking into account the total hits after each subsequent query. Here is an example using the Elasticsearch java API:

public List<Map<String, Object>> SortedQuery(Client c) throws Exception{
 Integer limit = 100;
 SearchResponse response;
String index;
List<Map<String, Object>> results = new ArrayList<Map<String,Object>>();
DateTime time = new DateTime(System.currentTimeMillis());
 int hits;
 long totalHits;
 String indexPrefix = "dns-test-";
 String indexSuffix;
  // Build a simple query that filters on the "URL" field
  AndFilterBuilder queryFilter =  FilterBuilders.andFilter();
  queryFilter.add(FilterBuilders.termFilter("URL",
"opendns.com/enterprise-security"));
 while (limit > 0) {
      // Calculate which index to query using the prefix and suffix
      indexSuffix = time.toString("yyyy-MM-dd-HH");
      index = indexPrefix + indexSuffix;
      // Execute the query
      response = c.prepareSearch(index)
              .addSort("TimeStamp", SortOrder.DESC)
              .setSize(limit)
              .setExplain(true)
              .setPostFilter(queryFilter)
              .execute()
              .actionGet();
      // Get the number of hits
      totalHits = response.getHits().getTotalHits();
      hits = (int) totalHits;
      if((long) hits != totalHits){
         throw new Exception("Cannot convert from int to long");
      }
      // Add every hit to the result set
      for (int i = 0; i < hits; i++) {
          Map<String, Object> qResponse =
                             response.getHits().getHits()[i].getSource();
          results.add(qResponse);
      }
      // Decrement time by one hour to follow an hourly index schema
      if(hits < limit) {
          time = time.minusHours(1);
      }
      // if hits is equal to or greater than limit, return
      else{
          return results;
      }
      // update the limit since we have already found some "hits"
      limit = limit - hits;
  }
  return results;
}

This function is meant to execute a sorted query on one index at a time in reverse chronological order, starting with the index corresponding to the current time. At the end of each loop, time is decremented by one hour in order to following an hourly index schema.

Conclusion

Elasticsearch has shown that it can be configured to competently search log data without wastefully scoring and sorting results. Additionally, by querying only the required indices and properly storing field data, Elasticsearch can return sorted queries fast and efficiently.

In Part 4 of this series, we will explore more advanced techniques and configuration in Elasticsearch that can take your cluster to the next level.

Continue to: Elasticsearch: You Know, For Logs [Part 4].

Suggested Blogs

  • Cloud Application Security – Risks, Questions, Insights, and Solutions July 1, 2021 3 minute read
  • Cisco Umbrella discovers evolving, complex cyberthreats in first half of 2020 August 18, 2020 6 minute read
  • New research shows consumers want cybersecurity from service providers July 7, 2020 4 minute read

Share this blog

FacebookTweetLinkedIn

Follow Us

  • Twitter
  • Facebook
  • LinkedIn
  • YouTube

Footer Sections

What we make

  • Cloud Security Service
  • DNS-Layer Network Security
  • Secure Web Gateway
  • Security Packages

Who we are

  • Global Cloud Architecture
  • Cloud Network Status
  • Cloud Network Activity
  • OpenDNS is now Umbrella
  • Cisco Umbrella Blog

Learn more

  • Webinars
  • Careers
  • Support
  • Cisco Umbrella Live Demo
  • Contact Sales
Umbrella by Cisco
208.67.222.222+208.67.220.220
2620:119:35::35+2620:119:53::53
Sign up for a Free Trial
  • Cisco Online Privacy Statement
  • Terms of Service
  • Sitemap

© 2023 Cisco Umbrella