Part 4: Advanced Options
Part 3 of this series explores searching and sorting log data in Elasticsearch and how to best configure Elasticsearch for these operations.
This post will focus on some other options in Elasticsearch for speeding up indexing and searching as well as saving on storage that didn’t have a place in any of the three previous posts. Some of these options come with trade offs and are not recommended in all cases.
Bulk Indexing
Indexing documents one at a time is inefficient, especially with small documents like log files. To speed up indexing, use the Elasticsearch Bulk API. The Bulk API allows documents to be indexed in batches instead of individually, greatly increasing indexing speed.
Bulk indexing using the REST API is fairly straight forward. Following this schema given by Elastic:
action_and_meta_datan optional_sourcen action_and_meta_datan optional_sourcen ....
A Bulk request will look like:
curl -XPOST 'http://localhost:9200/_bulk' -d '{ { "index" : { "_index" : "logs", "_type" : "log" } } { "Timestamp" : "2009-11-15T14:12:12", "URL" : "opendns.com", "IP Address" : "127.0.0.1", "log_id" : 1} { "index" : { "_index" : "logs", "_type" : "log" } } { "Timestamp" : "2009-11-15T14:12:13", "URL" : "opendns.com/enterprise-security", "IP Address" : "127.0.0.1", "log_id" : 2} { "index" : { "_index" : "logs", "_type" : "log" } } { "Timestamp" : "2009-11-15T14:12:13", "URL" : "opendns.com/about", "IP Address" : "127.0.0.1", "log_id" : 3} }'
A couple important configurations need to be set when using the Bulk API. Firstly, bulk requests should be done in batches of a specific size in order to optimize throughput. A batch size of Two-thousand five hundred is a good first guess; Elasticsearch suggests anywhere from 1000 to 5000.
Secondly, the Bulk API can drop documents if the Bulk Queue Size is set too low. If this happens the following remote transport exception will be thrown:
RemoteTransportException[[<>][inet[/127.0.0.1:9300]][bulk/shard]]; nested: EsRejectedExecutionException[rejected execution (queue capacity 50) on org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1@1234];
The threadpool.bulk.queue_size must be increased until this error is no longer thrown. The default value is 50; Increasing it to 1000 in Elasticsearch’s configuration(elasticsearch.yml) should solve the problem for most clusters:
threadpool.bulk.queue_size: 1000
Routing
Elasticsearch allows the ability to ensure related documents are all stored on the same shard using the routing feature. If a routing key is given to a document, Elasticsearch will read the key and route the document to the shard that contains other documents with the same key.
Example:
curl -XPOST 'http://localhost:9200/logs/log?routing=com' -d '{ "Timestamp" : "2009-11-15T14:12:12", "URL" : "opendns.com/about", "TLD" : "com", "IP Address" : "127.0.0.1" }'
Indexing a document with the command above will ensure it resides in the same shard as all other documents with the “com” routing key.
The routing path can be derived from the document itself in the mapping, similar to how the ID path is specified, for example:
$ curl -XPUT 'http://localhost:9200/logs/_mapping/log' -d ' { "log" : { "_routing" : {"path" : "TLD"}, "_id" : {"path" : "log_id"}, "properties" : { "Timestamp" : {"type" : "date"}, "URL" : {"type" : "string", "index": "not_analyzed"}, "TLD" : {"type" : "string", "index": "not_analyzed"}, "IP Address" : {"type": "ip"}, "log_id" : {"type" : "string", "index" : "not_analyzed"} } } }'
This mapping will route log documents by their “TLD” field.
Routing Advantages
The advantages of routing documents comes when it is time to execute search requests. When searching for documents with the same routing key, Elasticsearch knows exactly which shard the documents reside in and thus only has to send the search request to said shard.
Without Routing:
With Routing:
Since Elasticsearch only has to query a single shard, query response time will decrease significantly. Another added benefit is the nodes that don’t contain the target shard will not have to process any search request at all, thus saving valuable CPU cycles.
The benefits of routing documents increase with the number of shards in a cluster. Small clusters with under 50 shards will likely only see a small speed increase. Though as a cluster grows and approaches 200+ shards, the benefits of routing become more apparent:
To perform routed queries, simply include the routing key in the search request:
curl -XGET localhost:9200/_search?routing=com -d ' { "query": { "filtered": { "filter": { "term": { "URL": { "opendns.com" }} } } } } '
Routing Disadvantages
One of the benefits of letting Elasticsearch control the routing, is documents will be uniformly spread across all shards in an index. Once user-determined routing is introduced, this uniformity is lost since every document with the same routing key must exist in the same shard. This can cause certain shards to be far larger than others, and as a result can cause nodes to be “hot spotted” by common routing values, stressing the CPU and eating through storage.
For example, say we decided to route DNS logs by TLDs. This might work great for smaller TLDs such as “.gov” and “.edu.” But what happens if all “.com” and “.org” logs get routed to the same shard? The node containing said shard will receive far more traffic than the other nodes, just because it got unlucky and happened to catch two large routing values.
After introducing routing to our tests, we saw the following variations in CPU usage over our data nodes:
This graph shows the high volatility in CPU usage that routing can cause when there is variability in the traffic flowing to different routing keys.
Furthermore, it is possible in some cases for routing to cause issues with storage. Say for example we have 2TB of “.com” DNS logs in an index but our instances only have 1TB of disk. All 2TB of data will attempt to be routed to a single instance, which will promptly run out of storage.
For further information on routing, Sematext gives a very informative presentation on scaling massive Elasticsearch clusters that talk in detail about routing and its advantages.
Optimize
Within each Elasticsearch shard, there are several segments that get periodically merged as the segment count grows. Having multiple segments in a shard means search requests must search each segment individually, then aggregate and return the results. Having to aggregate the results means that having multiple segments per shard will slow down queries. By optimizing an index, Elasticsearch is merging every possible segment into a single segment to maximize query performance.
Running the following command will optimize an index down to a single segment:
$ curl -XPOST 'http://localhost:9200/twitter/_optimize?max_num_segments=1'
Note that this is an operation that should only be performed on old indices. For example, when indexing logs by time period, it would be smart to optimize indices whose time frame has ended since they won’t be creating new indices. It might be pointless however to optimize the index responsible for the current time period since it is active and will be creating additional segments anyways.
Also, optimizing an index is a heavy duty operation that should only be performed when the cluster can handle it.
For more information on Elasticsearch segments and the optimize API visit this page.
‘_all’ Field
By default Elasticsearch stores an ‘_all’ field for each document, which includes the contents of every field in the document. The _all field is meant to be searchable in the case where a user doesn’t want to specify the field names when searching. In most cases this functionality is not needed for log data, so it should be disabled to free up extra storage:
By adding the highlighted line to a PUT mapping request, the _all field will be disabled for documents of type “log” in index “logs”:
$ curl -XPUT 'http://localhost:9200/logs/_mapping/log' -d ' { "log" : { "_all" : {"enabled": false}, "properties" : { "Timestamp" : {"type" : "date"}, "URL" : {"type" : "string", "index": "not_analyzed"}, "IP Address" : {"type": "ip"} } } } '
Disabling ‘_source’
By default Elasticsearch stores the source JSON for each document in the ‘_source’ field. When executing searches, Elasticsearch will simply return the _source field. The downside to storing the source is it adds a lot of extra storage overhead for each document. If storage is a concern, it is possible to disable the _source field and instead store each field value individually. This is done through the mapping API:
$ curl -XPUT 'http://localhost:9200/logs/_mapping/log' -d ' { "log" : { "_source" : {"enabled" : false}, "_id" : {"path" : "log_id"}, "properties" : { "Timestamp" : {"type" : "date", "store" : "yes"}, "URL" : {"type" : "string", "index" : "not_analyzed", "store" : "yes"}, "IP Address" : {"type" : "ip", "store" : "yes"}, "log_id" : {"type" : "string", "index" : "not_analyzed"} } } }'
Notice that for each field you might want to be able to retrieve must have the {“store” : “yes”} clause.
Using this mapping will significantly reduce storage requirements, but it makes queries slightly more complex. Since Elasticsearch is no longer storing the source JSON, it will have to retrieve and aggregate the desired fields before returning. This can cause high Disk I/O and slower queries.
The following query will return the stored fields without needing a ‘_source’ JSON:
curl -XGET localhost:9200/_search -d ' { “fields” : [“Timestamp”, “URL”, “IP Address”], "query": { "filtered": { "filter": { "term": { "URL": { "opendns.com" }} } } } } '
Conclusion
This post was meant to demonstrate that Elasticsearch has several advanced features and configurations that can be used to customize a cluster for many different applications. Whether a cluster is being constrained by storage space, search speeds or indexing rates, Elasticsearch can accomodate. That said, there is still a plethora of cool features in Elasticsearch that were left out of this series.
To learn more about Elasticsearch, Elastic provides a highly detailed guide.