• Skip to primary navigation
  • Skip to main content
  • Skip to footer

Cisco Umbrella

Enterprise network security

  • Contact Sales
  • Login
    • Umbrella Login
    • Cloudlock Login
  • Why Us
    • Why Cisco Umbrella
      • Why Try Umbrella
      • Why DNS Security
      • Why Umbrella SASE
      • Our Customers
      • Customer Stories
      • Why Cisco Secure
    • Fast Reliable Cloud
      • Global Cloud Architecture
      • Cloud Network Status
      • Global Cloud Network Activity
    • Unmatched Intelligence
      • A New Approach to Cybersecurity
      • Interactive Intelligence
      • Cyber Attack Prevention
      • Umbrella and Cisco Talos Threat Intelligence
    • Extensive Integrations
      • IT Security Integrations
      • Hardware Integrations
      • Meraki Integration
      • Cisco Umbrella and SecureX
  • Products
    • Cisco Umbrella Products
      • Cisco Umbrella Cloud Security Service
      • Recursive DNS Services
      • Cisco Umbrella SIG
      • Umbrella Investigate
      • What’s New
    • Product Packages
      • Cisco Umbrella Package Comparison
      • – DNS Security Essentials Package
      • – DNS Security Advantage Package
      • – SIG Essentials Package
      • – SIG Advantage Package
      • Umbrella Support Packages
    • Functionality
      • DNS-Layer Security
      • Secure Web Gateway
      • Cloud Access Security Broker (CASB)
      • Cloud Data Loss Prevention (DLP)
      • Cloud-Delivered Firewall
      • Cloud Malware Protection
      • Remote Browser Isolation (RBI)
    • Man on a laptop with headphones on. He is attending a Cisco Umbrella Live Demo
  • Solutions
    • SASE & SSE Solutions
      • Cisco Umbrella SASE
      • Secure Access Service Edge (SASE)
      • What is SASE
      • What is Security Service Edge (SSE)
    • Functionality Solutions
      • Web Content Filtering
      • Secure Direct Internet Access
      • Shadow IT Discovery & App Blocking
      • Fast Incident Response
      • Unified Threat Management
      • Protect Mobile Users
      • Securing Remote and Roaming Users
    • Network Solutions
      • Guest Wi-Fi Security
      • SD-WAN Security
      • Off-Network Endpoint Security
    • Industry Solutions
      • Government and Public Sector Cybersecurity
      • Financial Services Security
      • Cybersecurity for Manufacturing
      • Higher Education Security
      • K-12 Schools Security
      • Healthcare, Retail and Hospitality Security
      • Enterprise Cloud Security
      • Small Business Cybersecurity
  • Resources
    • Content Library
      • Top Resources
      • Cybersecurity Webinars
      • Events
      • Research Reports
      • Case Studies
      • Videos
      • Datasheets
      • eBooks
      • Solution Briefs
    • International Documents
      • Deutsch/German
      • Español/Spanish
      • Français/French
      • Italiano/Italian
      • 日本語/Japanese
    • Security Definitions
      • What is Secure Access Service Edge (SASE)
      • What is Security Service Edge (SSE)
      • What is a Cloud Access Security Broker (CASB)
      • Cyber Threat Categories and Definitions
    • For Customers
      • Support
      • Customer Success Webinars
      • Cisco Umbrella Studio
  • Trends & Threats
    • Market Trends
      • Hybrid Workforce
      • Rise of Remote Workers
      • Secure Internet Gateway (SIG)
    • Security Threats
      • How to Stop Phishing Attacks
      • Malware Detection and Protection
      • Ransomware is on the Rise
      • Cryptomining Malware Protection
      • Cybersecurity Threat Landscape
      • Global Cyber Threat Intelligence
    •  
    • Woman connecting confidently to any device anywhere
  • Partners
    • Channel Partners
      • Partner Program
      • Become a Partner
    • Service Providers
      • Secure Connectivity
      • Managed Security for MSSPs
      • Managed IT for MSPs
    •  
    • Person looking down at laptop. They are connecting and working securely
  • Blog
    • News & Product Posts
      • Latest Posts
      • Products & Services
      • Customer Focus
      • Feature Spotlight
    • Cybersecurity Posts
      • Security
      • Threats
      • Cybersecurity Threat Spotlight
      • Research
    •  
    • Register for a webinar - with illustration of connecting securely to the cloud
  • Contact Us
  • Umbrella Login
  • Cloudlock Login
  • Free Trial
Research

Elasticsearch: You Know, For Logs [Part 4]

Author avatar of UmbrellaEngineeringUmbrellaEngineering
Updated — October 15, 2020 • 6 minute read
View blog >

Part 4: Advanced Options

Part 3 of this series explores searching and sorting log data in Elasticsearch and how to best configure Elasticsearch for these operations.
This post will focus on some other options in Elasticsearch for speeding up indexing and searching as well as saving on storage that didn’t have a place in any of the three previous posts. Some of these options come with trade offs and are not recommended in all cases.

Bulk Indexing

Indexing documents one at a time is inefficient, especially with small documents like log files. To speed up indexing, use the Elasticsearch Bulk API. The Bulk API allows documents to be indexed in batches instead of individually, greatly increasing indexing speed.
Bulk indexing using the REST API is fairly straight forward. Following this schema given by Elastic:

action_and_meta_datan
optional_sourcen
action_and_meta_datan
optional_sourcen
....

A Bulk request will look like:

curl -XPOST 'http://localhost:9200/_bulk' -d '{
{ "index" : { "_index" : "logs", "_type" : "log" } }
{ "Timestamp" : "2009-11-15T14:12:12", "URL" : "opendns.com", "IP Address" : "127.0.0.1", "log_id" : 1}
{ "index" : { "_index" : "logs", "_type" : "log" } }
{ "Timestamp" : "2009-11-15T14:12:13", "URL" : "opendns.com/enterprise-security", "IP Address" : "127.0.0.1", "log_id" : 2}
{ "index" : { "_index" : "logs", "_type" : "log" } }
{ "Timestamp" : "2009-11-15T14:12:13", "URL" : "opendns.com/about", "IP Address" : "127.0.0.1", "log_id" : 3}
}'

A couple important configurations need to be set when using the Bulk API. Firstly, bulk requests should be done in batches of a specific size in order to optimize throughput. A batch size of Two-thousand five hundred is a good first guess; Elasticsearch suggests anywhere from 1000 to 5000.
Secondly, the Bulk API can drop documents if the Bulk Queue Size is set too low. If this happens the following remote transport exception will be thrown:
RemoteTransportException[[<>][inet[/127.0.0.1:9300]][bulk/shard]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 50) on org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1@1234];
The threadpool.bulk.queue_size must be increased until this error is no longer thrown. The default value is 50; Increasing it to 1000 in Elasticsearch’s configuration(elasticsearch.yml) should solve the problem for most clusters:
threadpool.bulk.queue_size: 1000

Routing

Elasticsearch allows the ability to ensure related documents are all stored on the same shard using the routing feature. If a routing key is given to a document, Elasticsearch will read the key and route the document to the shard that contains other documents with the same key.
Example:

curl -XPOST 'http://localhost:9200/logs/log?routing=com' -d '{
    "Timestamp" : "2009-11-15T14:12:12",
    "URL" : "opendns.com/about",
    "TLD" : "com",
    "IP Address" : "127.0.0.1"
}'

Indexing a document with the command above will ensure it resides in the same shard as all other documents with the “com” routing key.
The routing path can be derived from the document itself in the mapping, similar to how the ID path is specified, for example:

$ curl -XPUT 'http://localhost:9200/logs/_mapping/log' -d '
{
  "log" : {
          "_routing" : {"path" : "TLD"},
         "_id" : {"path" : "log_id"},
        "properties" : {
           "Timestamp" : {"type" : "date"},
                "URL" : {"type" : "string",
                                "index": "not_analyzed"},
                "TLD" : {"type" : "string",
                                "index": "not_analyzed"},
           "IP Address" : {"type": "ip"},
         "log_id" : {"type" : "string",
                     "index" : "not_analyzed"}
        }
    }
}'

This mapping will route log documents by their “TLD” field.

Routing Advantages

The advantages of routing documents comes when it is time to execute search requests. When searching for documents with the same routing key, Elasticsearch knows exactly which shard the documents reside in and thus only has to send the search request to said shard.
Without Routing:
Screen Shot 2015-05-19 at 2.37.54 PM
With Routing:
Screen Shot 2015-05-19 at 2.38.45 PM
Since Elasticsearch only has to query a single shard, query response time will decrease significantly. Another added benefit is the nodes that don’t contain the target shard will not have to process any search request at all, thus saving valuable CPU cycles.
The benefits of routing documents increase with the number of shards in a cluster. Small clusters with under 50 shards will likely only see a small speed increase. Though as a cluster grows and approaches 200+ shards, the benefits of routing become more apparent:
Screen Shot 2015-05-19 at 2.39.29 PM
To perform routed queries, simply include the routing key in the search request:

curl -XGET localhost:9200/_search?routing=com -d '
{
  "query": {
    "filtered": {
      "filter": {
        "term": { "URL": { "opendns.com" }}
      }
    }
  }
}
'

Routing Disadvantages

One of the benefits of letting Elasticsearch control the routing, is documents will be uniformly spread across all shards in an index. Once user-determined routing is introduced, this uniformity is lost since every document with the same routing key must exist in the same shard. This can cause certain shards to be far larger than others, and as a result can cause nodes to be “hot spotted” by common routing values, stressing the CPU and eating through storage.
For example, say we decided to route DNS logs by TLDs. This might work great for smaller TLDs such as “.gov” and “.edu.” But what happens if all “.com” and “.org” logs get routed to the same shard? The node containing said shard will receive far more traffic than the other nodes, just because it got unlucky and happened to catch two large routing values.
After introducing routing to our tests, we saw the following variations in CPU usage over our data nodes:
Screen Shot 2015-05-19 at 2.40.50 PM
This graph shows the high volatility in CPU usage that routing can cause when there is variability in the traffic flowing to different routing keys.
Furthermore, it is possible in some cases for routing to cause issues with storage. Say for example we have 2TB of “.com” DNS logs in an index but our instances only have 1TB of disk. All 2TB of data will attempt to be routed to a single instance, which will promptly run out of storage.
For further information on routing, Sematext gives a very informative presentation on scaling massive Elasticsearch clusters that talk in detail about routing and its advantages.

Optimize

Within each Elasticsearch shard, there are several segments that get periodically merged as the segment count grows. Having multiple segments in a shard means search requests must search each segment individually, then aggregate and return the results. Having to aggregate the results means that having multiple segments per shard will slow down queries. By optimizing an index, Elasticsearch is merging every possible segment into a single segment to maximize query performance.
Running the following command will optimize an index down to a single segment:

$ curl -XPOST 'http://localhost:9200/twitter/_optimize?max_num_segments=1'

Note that this is an operation that should only be performed on old indices. For example, when indexing logs by time period, it would be smart to optimize indices whose time frame has ended since they won’t be creating new indices. It might be pointless however to optimize the index responsible for the current time period since it is active and will be creating additional segments anyways.
Also, optimizing an index is a heavy duty operation that should only be performed when the cluster can handle it.
For more information on Elasticsearch segments and the optimize API visit this page.

‘_all’ Field

By default Elasticsearch stores an ‘_all’ field for each document, which includes the contents of every field in the document. The _all field is meant to be searchable in the case where a user doesn’t want to specify the field names when searching. In most cases this functionality is not needed for log data, so it should be disabled to free up extra storage:
By adding the highlighted line to a PUT mapping request, the _all field will be disabled for documents of type “log” in index “logs”:

$ curl -XPUT 'http://localhost:9200/logs/_mapping/log' -d '
{
    "log" : {
          "_all" : {"enabled": false},
        "properties" : {
          "Timestamp" : {"type" : "date"},
               "URL" : {"type" : "string",
                              "index": "not_analyzed"},
        "IP Address" : {"type": "ip"}
          }
     }
}
'

Disabling ‘_source’

By default Elasticsearch stores the source JSON for each document in the ‘_source’ field. When executing searches, Elasticsearch will simply return the _source field. The downside to storing the source is it adds a lot of extra storage overhead for each document. If storage is a concern, it is possible to disable the _source field and instead store each field value individually. This is done through the mapping API:

$ curl -XPUT 'http://localhost:9200/logs/_mapping/log' -d '
{
  "log" : {
          "_source" : {"enabled" : false},
         "_id" : {"path" : "log_id"},
       "properties" : {
         "Timestamp" : {"type" : "date",
                                 "store" : "yes"},
              "URL" : {"type" : "string",
                              "index" : "not_analyzed",
                             "store" : "yes"},
          "IP Address" : {"type" : "ip",
                                 "store" : "yes"},
        "log_id" : {"type" : "string",
                     "index" : "not_analyzed"}
        }
    }
}'

Notice that for each field you might want to be able to retrieve must have the {“store” : “yes”} clause.
Using this mapping will significantly reduce storage requirements, but it makes queries slightly more complex. Since Elasticsearch is no longer storing the source JSON, it will have to retrieve and aggregate the desired fields before returning. This can cause high Disk I/O and slower queries.
The following query will return the stored fields without needing a ‘_source’ JSON:

curl -XGET localhost:9200/_search -d '
{
“fields” : [“Timestamp”, “URL”, “IP Address”],
  "query": {
    "filtered": {
      "filter": {
        "term": { "URL": { "opendns.com" }}
      }
    }
  }
}
'

Conclusion

This post was meant to demonstrate that Elasticsearch has several advanced features and configurations that can be used to customize a cluster for many different applications. Whether a cluster is being constrained by storage space, search speeds or indexing rates, Elasticsearch can accomodate. That said, there is still a plethora of cool features in Elasticsearch that were left out of this series.
To learn more about Elasticsearch, Elastic provides a highly detailed guide.

Suggested Blogs

  • Cloud Application Security – Risks, Questions, Insights, and Solutions July 1, 2021 3 minute read
  • Cisco Umbrella discovers evolving, complex cyberthreats in first half of 2020 August 18, 2020 6 minute read
  • New research shows consumers want cybersecurity from service providers July 7, 2020 4 minute read

Share this blog

FacebookTweetLinkedIn

Follow Us

  • Twitter
  • Facebook
  • LinkedIn
  • YouTube

Footer Sections

What we make

  • Cloud Security Service
  • DNS-Layer Network Security
  • Secure Web Gateway
  • Security Packages

Who we are

  • Global Cloud Architecture
  • Cloud Network Status
  • Cloud Network Activity
  • OpenDNS is now Umbrella
  • Cisco Umbrella Blog

Learn more

  • Webinars
  • Careers
  • Support
  • Cisco Umbrella Live Demo
  • Contact Sales
Umbrella by Cisco
208.67.222.222+208.67.220.220
2620:119:35::35+2620:119:53::53
Sign up for a Free Trial
  • Cisco Online Privacy Statement
  • Terms of Service
  • Sitemap

© 2023 Cisco Umbrella