Part 3: Searching and Sorting
In Part 2 of this series, Elasticsearch proved that it could be scaled up with relative ease using different instance roles in order to balance the workload. Additionally, Elasticsearch showed that, given proper setup and configuration, it can handle several different failure cases while maintaining high availability.
This post will focus on how to optimize searching and sorting in Elasticsearch for log data.
Log Search
By default, Elasticsearch expects to be querying full text documents, not log data. As a result, Elasticsearch will score and sort each hit based on a relevancy factor since a single document could have more than one occurrence of the search term. This functionality can be avoided when searching log data by using a “‘filtered”’ query.
For example, the following query uses a filter that doesn’t score its hits:
curl -XGET localhost:9200/_search -d ' { "size" : 10, "query": { "filtered": { "filter": { "term": { "URL": "opendns.com/enterprise-security"} } } } } '
This query will return the first 10 documents matching the filter without having to score and sort each document.
Sorted Queries
Logs often need to be returned in a sorted order. Since documents cannot be stored pre-sorted in Elasticsearch, sorting becomes a fairly challenging problem that can greatly slow down queries. Sorting is achieved in Elasticsearch by adding a “sort” clause to a query. Hits of a sorted query will not be scored.
Example:
curl -XGET localhost:9200/_search -d ' { "sort" : [ { "IP Address" : {"order" : "asc"}, "Timestamp" : { "order" : "desc" } } ], "query": { "filtered": { "filter": { "term": { "URL": "opendns.com/enterprise-security"} } } } } '
The query above will return all logs with the URL field “opendns.com/enterprise-security,” sorted in ascending order by IP address. Then any logs with matching IP address fields are sorted in descending order by their timestamp field. Elasticsearch will not score any of the documents matching this query since a different sorting field is given.
Without proper configuration, a sorted query could result in a massive increase in response time. This is because no matter the size of your return set, every document that matches the query must be sorted. Even if only the top two documents are asked to be returned, Elasticsearch must sort every document that matches the query to actually determine the top two. This means that a query will have to sort the same number of values regardless of how many documents it actually returns.
The default sorting process in Elasticsearch is simple. If Elasticsearch is asked to return the top n hits of a sorted query, it will do the following:
- First each document that matches the query will be found as a “hit.”
- Then, for every hit, each shard loads all the data from the field to be sorted into memory.
- The fields will then be sorted and the top n hits will be returned from each shard.
- Then, the top hits from each shard will be aggregated and re-sorted to ensure accuracy of the results for each index.
- Then, the top hit from each index will be aggregated and re-sorted to ensure accuracy of the final results.
- Finally the top n hits from the cluster are returned.
This process is slow and unavoidable, but Elasticsearch has a some options to speed it up.
Field Data
Elasticsearch stores all the values of the fields to be sorted on in its Field Data Cache. The problem is pulling field data into the cache at the time of the request is a heavy operation since all this data is read from disk. This behavior can be avoided by maintaining a “Field Data Cache.”
Fortunately, Elasticsearch allows field data to be loaded into the cache at the indexing stage by “eagerly” loading the field data. For example, the following template will eagerly load timestamp data for all type ‘log’ documents in “dnslog-*’ indices:
curl -XPUT localhost:9200/_template/dnslog_template -d ' { "template" : "dnslog-*", "settings" : { "number_of_shards" : 3 "number_of_replicas": 1 }, "mappings" : { "log" : { "properties" : { "Timestamp" : {"type" : "date", "fielddata": { "loading" : "eager" } }, "URL" : {"type" : "string", "index": "not_analyzed"}, "IP Address" : {"type": "ip"} } } } } '
The size of the field data cache is configured by two variables in the elasticsearch.yml file:
indices.fielddata.cache.expire: 1d
indices.fielddata.cache.size: 25%
Eagerly loading field data can fill memory quickly if it isn’t controlled. These settings give you options for expiring and limiting field data. In most real time search applications, recent data will receive by far the most traffic. So setting the indices.fielddata.cache.expire to 1d will ensure recently indexed data can be sorted quickly, while also ensuring that field data from older data does not fill up memory. In the case where the field data cache does run out of memory, it will evict data using the Least Recently Used Algorithm if necessary.
Doc Values
Although using in-memory field data is the fastest way to sort, field data can consume a lot of memory. For large enough volumes of data, maintaining an in-memory field data cache may significantly slow down the application or simply be unfeasible. Luckily Elasticsearch provides an alternative to in-memory field data called “Doc Values”. Doc Values is a field data format, essentially the same as in-memory field data, except that it is stored on disk instead of in memory. Enabling Doc Values means more memory for other operations and less garbage collections while only causing a small increase in sorting speed (10-25% according to Elastic).
Doc Values is enabled by changing the field data format to “doc_values” in the mapping for the field(s) that will be sorted on. For example:
"log" : { "properties" : { "Timestamp" : {"type" : "date", "fielddata": { "format": "doc_values" } }, "URL" : {"type" : "string", "index": "not_analyzed"}, "IP Address" : {"type": "ip"} } }
Once the field data format is set to “doc_values”, Elasticsearch will automatically store the field data on disk for each document when it is indexed. It will also automatically start using the on-disk field data for sorting.
Minimize Sorting
Another option that can potentially reduce the response time of sorted queries is to take advantage of your index schema. For example if indices are partitioned by time period, such as by day or by hour, and results are being sorted on the Timestamp field, there is no need to sort every hit. It is far more efficient to first query the most recent index, only moving on to the next index if the number of hits of the first query is smaller than the desired number of results. Applying this technique ensures the minimum number of fields are sorted.
This sorting technique is not supported by Elasticsearch itself and thus must be implemented by the Elasticsearch Client application. This can be done using a simple “for loop”, taking into account the total hits after each subsequent query. Here is an example using the Elasticsearch java API:
public List<Map<String, Object>> SortedQuery(Client c) throws Exception{
Integer limit = 100; SearchResponse response; String index; List<Map<String, Object>> results = new ArrayList<Map<String,Object>>(); DateTime time = new DateTime(System.currentTimeMillis());
int hits; long totalHits;
String indexPrefix = "dns-test-"; String indexSuffix;
// Build a simple query that filters on the "URL" field AndFilterBuilder queryFilter = FilterBuilders.andFilter(); queryFilter.add(FilterBuilders.termFilter("URL", "opendns.com/enterprise-security"));
while (limit > 0) {
// Calculate which index to query using the prefix and suffix indexSuffix = time.toString("yyyy-MM-dd-HH"); index = indexPrefix + indexSuffix; // Execute the query response = c.prepareSearch(index) .addSort("TimeStamp", SortOrder.DESC) .setSize(limit) .setExplain(true) .setPostFilter(queryFilter) .execute() .actionGet();
// Get the number of hits totalHits = response.getHits().getTotalHits(); hits = (int) totalHits; if((long) hits != totalHits){ throw new Exception("Cannot convert from int to long"); }
// Add every hit to the result set for (int i = 0; i < hits; i++) { Map<String, Object> qResponse = response.getHits().getHits()[i].getSource(); results.add(qResponse); }
// Decrement time by one hour to follow an hourly index schema if(hits < limit) { time = time.minusHours(1); }
// if hits is equal to or greater than limit, return else{ return results; }
// update the limit since we have already found some "hits" limit = limit - hits; } return results; }
This function is meant to execute a sorted query on one index at a time in reverse chronological order, starting with the index corresponding to the current time. At the end of each loop, time is decremented by one hour in order to following an hourly index schema.
Conclusion
Elasticsearch has shown that it can be configured to competently search log data without wastefully scoring and sorting results. Additionally, by querying only the required indices and properly storing field data, Elasticsearch can return sorted queries fast and efficiently.
In Part 4 of this series, we will explore more advanced techniques and configuration in Elasticsearch that can take your cluster to the next level.
Continue to: Elasticsearch: You Know, For Logs [Part 4].