• Skip to primary navigation
  • Skip to main content
  • Skip to footer

Cisco Umbrella

Enterprise network security

  • Free Trial
  • Contact us
  • Blog
  • Login
    • Umbrella Login
    • Cloudlock Login
  • Products
    • Product
      • Cisco Umbrella Cloud Security Service
      • Cisco Umbrella Investigate
      • Product Packages
      • Support Packages
    • Functionality
      • DNS-Layer Security
      • Secure Web Gateway
      • Cloud Access Security Broker (CASB)
      • Interactive Intelligence
      • Cloud-Delivered Firewall
    •  
    • Webinar signup
  • Solutions
    • By Need
      • Protect Mobile Users
      • Fast Incident Response
      • Web Content Filtering
      • Shadow IT Discovery & App Blocking
      • Unified Threat Enforcement
      • Reduce Security Infections
      • Secure Direct Internet Access
      • Securing Remote and Roaming Users
    • By Network
      • Protect Guest Wi-Fi
      • SD-WAN Security
      • Off-Network Endpoint Security
    • By Industry
      • Higher Education Security
      • K-12 Schools Security
      • Healthcare, Retail and Hospitality Security
      • Enterprise Cloud Security
      • Small Business Cybersecurity
      • Our Customers
      • Customer Stories
    • Ransomware Defense for Dummies book
  • Why Us
    • Fast Reliable Cloud
      • Cloud Security Infrastructure
      • Cloud Network Status
      • Cloud Network Activity
      • Recursive DNS Services
      • Top Reasons to Trial
      • Getting Started
    • Unmatched Intelligence
      • Cyber Attack Prevention
      • Interactive Intelligence
    • Extensive Integrations
      • IT Security Integrations
      • Hardware Integrations
      • Meraki Integration
      • Cisco SD-WAN
    • Navigation-dropdown-promo-free-trial_102820
  • Resources
    • Content Library
      • Top Resources
      • Analyst Reports
      • Case Studies
      • Customer Videos
      • Datasheets
      • eBooks
      • Infographics
      • Solution Briefs
    • International Documents
      • Deutsch/German
      • Español/Spanish
      • Français/French
      • Italiano/Italian
      • 日本語/Japanese
    • Cisco Umbrella Blog
      • Latest Posts
      • Security Posts
      • Research Posts
      • Threats Posts
      • Product Posts
      • Spotlight
    • For Customers
      • Support
      • Customer Success Hub
      • Umbrella Deployment Hub
      • Customer Success Webinars
      • What’s New
      • Cisco Umbrella Studio
  • Trends & Threats
    • Market Trends
      • Rise of Remote Workers
      • Secure Internet Gateway (SIG)
      • Secure Access Service Edge (SASE)
    • Security Threats
      • Ransomware
      • Cryptomining Malware Protection
      • Cybersecurity Threat Landscape
    •  
    • 2020 Cybersecurity trends
  • Partners
    • Channel Partners
      • Partner Program
      • Become a Partner
    • Service Providers
      • Secure Connectivity
      • Managed Security for MSSPs
      • Managed IT for MSPs
    •  
    • Become a partner
  • Free Trial Signup
  • Umbrella Login
  • Cloudlock Login
  • Contact Us
Research

Elasticsearch: You Know, For Logs [Part 3]

By OpenDNS Engineering
Posted on May 12, 2015
Updated on October 15, 2020

Share

Facebook0Tweet0LinkedIn0

Part 3: Searching and Sorting

In Part 2 of this series, Elasticsearch proved that it could be scaled up with relative ease using different instance roles in order to balance the workload. Additionally, Elasticsearch showed that, given proper setup and configuration, it can handle several different failure cases while maintaining high availability.

This post will focus on how to optimize searching and sorting in Elasticsearch for log data.

Log Search

By default, Elasticsearch expects to be querying full text documents, not log data. As a result, Elasticsearch will score and sort each hit based on a relevancy factor since a single document could have more than one occurrence of the search term. This functionality can be avoided when searching log data by using a “‘filtered”’ query.

For example, the following query uses a filter that doesn’t score its hits:

curl -XGET localhost:9200/_search -d '
{
"size" : 10,
  "query": {
    "filtered": {
      "filter": {
        "term": { "URL": "opendns.com/enterprise-security"}
      }
    }
  }
}
'

This query will return the first 10 documents matching the filter without having to score and sort each document.

Sorted Queries

Logs often need to be returned in a sorted order. Since documents cannot be stored pre-sorted in Elasticsearch, sorting becomes a fairly challenging problem that can greatly slow down queries. Sorting is achieved in Elasticsearch by adding a “sort” clause to a query. Hits of a sorted query will not be scored.

Example:

curl -XGET localhost:9200/_search -d '
{
"sort" : [
    { "IP Address" : {"order" : "asc"},
     "Timestamp" : { "order" : "desc" }
    }
],
  "query": {
    "filtered": {
      "filter": {
        "term": { "URL": "opendns.com/enterprise-security"}
      }
    }
  }
}
'

The query above will return all logs with the URL field “opendns.com/enterprise-security,” sorted in ascending order by IP address. Then any logs with matching IP address fields are sorted in descending order by their timestamp field. Elasticsearch will not score any of the documents matching this query since a different sorting field is given.

Without proper configuration, a sorted query could result in a massive increase in response time. This is because no matter the size of your return set, every document that matches the query must be sorted. Even if only the top two documents are asked to be returned, Elasticsearch must sort every document that matches the query to actually determine the top two. This means that a query will have to sort the same number of values regardless of how many documents it actually returns.

The default sorting process in Elasticsearch is simple. If Elasticsearch is asked to return the top n hits of a sorted query, it will do the following:

  • First each document that matches the query will be found as a “hit.”
  • Then, for every hit, each shard loads all the data from the field to be sorted into memory.
  • The fields will then be sorted and the top n hits will be returned from each shard.
  • Then, the top hits from each shard will be aggregated and re-sorted to ensure accuracy of the results for each index.
  • Then, the top hit from each index will be aggregated and re-sorted to ensure accuracy of the final results.
  • Finally the top n hits from the cluster are returned.

This process is slow and unavoidable, but Elasticsearch has a some options to speed it up.

Field Data

Elasticsearch stores all the values of the fields to be sorted on in its Field Data Cache. The problem is pulling field data into the cache at the time of the request is a heavy operation since all this data is read from disk. This behavior can be avoided by maintaining a “Field Data Cache.”

Fortunately, Elasticsearch allows field data to be loaded into the cache at the indexing stage by “eagerly” loading the field data. For example, the following template will eagerly load timestamp data for all type ‘log’ documents in “dnslog-*’ indices:

curl -XPUT localhost:9200/_template/dnslog_template -d '
{
    "template" : "dnslog-*",
    "settings" : {
        "number_of_shards" : 3
      "number_of_replicas": 1
    },
    "mappings" : {
        "log" : {
            "properties" : {
                 "Timestamp" : {"type" : "date",
                             "fielddata": {
                                 "loading" : "eager"
                                 }
                            },
                "URL" : {"type" : "string",
                           "index": "not_analyzed"},
                "IP Address" : {"type": "ip"}
            }
        }
    }
}
'

The size of the field data cache is configured by two variables in the elasticsearch.yml file:

indices.fielddata.cache.expire: 1d
indices.fielddata.cache.size: 25%

Eagerly loading field data can fill memory quickly if it isn’t controlled. These settings give you options for expiring and limiting field data. In most real time search applications, recent data will receive by far the most traffic. So setting the indices.fielddata.cache.expire to 1d will ensure recently indexed data can be sorted quickly, while also ensuring that field data from older data does not fill up memory. In the case where the field data cache does run out of memory, it will evict data using the Least Recently Used Algorithm if necessary.

Doc Values

Although using in-memory field data is the fastest way to sort, field data can consume a lot of memory. For large enough volumes of data, maintaining an in-memory field data cache may significantly slow down the application or simply be unfeasible. Luckily Elasticsearch provides an alternative to in-memory field data called “Doc Values”. Doc Values is a field data format, essentially the same as in-memory field data, except that it is stored on disk instead of in memory. Enabling Doc Values means more memory for other operations and less garbage collections while only causing a small increase in sorting speed (10-25% according to Elastic).

Doc Values is enabled by changing the field data format to “doc_values” in the mapping for the field(s) that will be sorted on. For example:

"log" : {
            "properties" : {
                 "Timestamp" : {"type" : "date",
                              "fielddata": {
                                 "format": "doc_values"
                                 }
                             },
                 "URL" : {"type" : "string",
                            "index": "not_analyzed"},
                 "IP Address" : {"type": "ip"}
            }
        }

Once the field data format is set to “doc_values”, Elasticsearch will automatically store the field data on disk for each document when it is indexed. It will also automatically start using the on-disk field data for sorting.

Minimize Sorting

Another option that can potentially reduce the response time of sorted queries is to take advantage of your index schema. For example if indices are partitioned by time period, such as by day or by hour, and results are being sorted on the Timestamp field, there is no need to sort every hit.  It is far more efficient to first query the most recent index, only moving on to the next index if the number of hits of the first query is smaller than the desired number of results. Applying this technique ensures the minimum number of fields are sorted.

This sorting technique is not supported by Elasticsearch itself and thus must be implemented by the Elasticsearch Client application. This can be done using a simple “for loop”, taking into account the total hits after each subsequent query. Here is an example using the Elasticsearch java API:

public List<Map<String, Object>> SortedQuery(Client c) throws Exception{
 Integer limit = 100;
 SearchResponse response;
String index;
List<Map<String, Object>> results = new ArrayList<Map<String,Object>>();
DateTime time = new DateTime(System.currentTimeMillis());
 int hits;
 long totalHits;
 String indexPrefix = "dns-test-";
 String indexSuffix;
  // Build a simple query that filters on the "URL" field
  AndFilterBuilder queryFilter =  FilterBuilders.andFilter();
  queryFilter.add(FilterBuilders.termFilter("URL",
"opendns.com/enterprise-security"));
 while (limit > 0) {
      // Calculate which index to query using the prefix and suffix
      indexSuffix = time.toString("yyyy-MM-dd-HH");
      index = indexPrefix + indexSuffix;
      // Execute the query
      response = c.prepareSearch(index)
              .addSort("TimeStamp", SortOrder.DESC)
              .setSize(limit)
              .setExplain(true)
              .setPostFilter(queryFilter)
              .execute()
              .actionGet();
      // Get the number of hits
      totalHits = response.getHits().getTotalHits();
      hits = (int) totalHits;
      if((long) hits != totalHits){
         throw new Exception("Cannot convert from int to long");
      }
      // Add every hit to the result set
      for (int i = 0; i < hits; i++) {
          Map<String, Object> qResponse =
                             response.getHits().getHits()[i].getSource();
          results.add(qResponse);
      }
      // Decrement time by one hour to follow an hourly index schema
      if(hits < limit) {
          time = time.minusHours(1);
      }
      // if hits is equal to or greater than limit, return
      else{
          return results;
      }
      // update the limit since we have already found some "hits"
      limit = limit - hits;
  }
  return results;
}

This function is meant to execute a sorted query on one index at a time in reverse chronological order, starting with the index corresponding to the current time. At the end of each loop, time is decremented by one hour in order to following an hourly index schema.

Conclusion

Elasticsearch has shown that it can be configured to competently search log data without wastefully scoring and sorting results. Additionally, by querying only the required indices and properly storing field data, Elasticsearch can return sorted queries fast and efficiently.

In Part 4 of this series, we will explore more advanced techniques and configuration in Elasticsearch that can take your cluster to the next level.

Continue to: Elasticsearch: You Know, For Logs [Part 4].

Previous Post:

Previous Article

Next Post:

Next Article

Follow Us

  • Twitter
  • Facebook
  • LinkedIn
  • YouTube

Footer Sections

What we make

  • Cloud Security Service
  • DNS-Layer Network Security
  • Secure Web Gateway
  • Security Packages

Who we are

  • Cisco Umbrella Blog
  • Cloud Network Status
  • Cloud Network Activity
  • OpenDNS is now Cisco Umbrella

Learn more

  • Events
  • Careers
  • Support
  • Cisco Umbrella Live Demo
  • Contact Sales
Umbrella by Cisco
208.67.222.222+208.67.220.220
2620:119:35::35+2620:119:53::53
Sign up for a Free Trial
  • Cisco Online Privacy Statement
  • Terms of Service
  • Sitemap

© 2021 Cisco Umbrella