The data platform team at OpenDNS is always looking at new technologies to improve our real time search platform. Consequently, we have been keeping a close eye on Elasticsearch for quite a while, and even use it for some internal tools and metrics.
OpenDNS is now looking at using Elasticsearch as a real time search engine for our DNS log data. OpenDNS needs a powerful real time logging and search platform for several reasons. First and foremost is for our customers. Our customers need to be able to identify malicious activity on their networks as it is happening so they can respond promptly. Any time spent waiting for the data to come in is time that infections could be spreading or attacks could be gaining momentum. Similarly, we at OpenDNS use this data to monitor our own systems using several different metrics. If something goes wrong, we need to know right away so we can fix the problem before it propagates. Overall, getting data in real time means that the people monitoring this data can react in real time. This reaction time could mean the difference between a minor headache and a catastrophic problem.
For Elasticsearch to solve this problem, it not only has to be real-time, but scalable and manageable as well. OpenDNS is growing quickly, so we need a system that can grow with us without introducing technical debt. Also, our engineers don’t like getting paged on Sunday at 3 a.m., so we need a system that can deal with failures automatically without missing a beat.
Part 1: Introduction and Setup
This blog post is the first in a series that will focus on Elasticsearch and how to optimize it for log data. For information on other Elasticsearch products, including their recommended real time logging stack “ELK,” vist https://www.elastic.co/products.
Furthermore, this series will mainly show examples using Elasticsearch’s REST API, because it is simple and easy to use. With that said, Elasticsearch supports several client languages, listed here.
What is Elasticsearch?
Elasticsearch is a highly scalable search platform based on Apache Lucene. It is built from the ground up for the cloud and supports distributed indices and multitenancy. Since its release in 2010, Elasticsearch has gained many notable users and remains a very active project at Elastic under its creator Shay Banon.
Getting started with Elasticsearch
The Elasticsearch website has great documentation to walk users through installation.
Elasticsearch was designed to be distributed, so to demonstrate its full functionality it is important to set up a cluster of at least three nodes. Creating a three node cluster should be as simple as running three Elasticsearch instances with the same cluster name. The “cluster.name” variable is found in the main Elasticsearch configuration file “elasticsearch.yml” in the “config” folder. By default a three-node cluster will include two data nodes along with one elected master node.
Once Elasticsearch is installed and the cluster is connected, users need a way of visualizing their cluster. Elasticsearch supports a plugin called “Head” that will to exactly this, plus a little extra.
To install Head simply run this command from the ‘elasticsearch’ folder:
bin/plugin --install mobz/elasticsearch-head
Once installed, it can be accessed through
http://<hostname>:9200/_plugin/head/
Head gives an intuitive view of the indices and shards of an Elasticsearch cluster as well as the ability to easily browse and search through its documents. When head is first opened, it should look something like the following:
This image shows a simple Elasticsearch cluster with three nodes: two data nodes indicated by the black circles and one master, indicated by the black star. On the right is an index named “dns-test-1.” Each green box represents a “Shard,” which are Lucene Inverted Indices under the hood. By default an Elasticsearch index has five shards, each with one replica. Primary shards are shown with emboldened borders and replicas are shown with light borders.
Elasticsearch is designed to work straight out of the box, no need to worry about schemas or creating indices yet. As long as each document is given a type and an index name, it will index the document.
Example:
curl -XPOST 'http://localhost:9200/logs/log' -d '{ "Timestamp" : "2009-11-15T14:12:12", "URL" : "opendns.com", "IP Address" : "127.0.0.1",
"log_id" : 1 }'
Sending a simple index request will automatically create an index called “logs” and index the given document. Elasticsearch will also automatically guess the type of each field and index the doc. Elasticsearch is pretty smart. If formatted properly the “Timestamp” field will default to the “date” data type and the “log_id” field will default to type “int.” Though trickier fields such as “IP Address” will default to just a string, even though Elasticsearch does have a native “IP” data type. The documents themselves are indexed and stored in JSON format.
Note that once the data type for a field is set, it cannot be changed. For example if the following document:
curl -XPOST 'http://localhost:9200/logs/log' -d '{
"Timestamp" : "2009-11-15T14:12:12",
"URL" : "opendns.com/enterprise-security",
"IP Address" : "127.0.0.1",
"log_id" : "abcd"
}'
is indexed after the previous one, the following exception is thrown:
RemoteTransportException[[Reptyl][inet[/10.70.99.146:9301]][indices:data/write/index]]; nested: MapperParsingException[failed to parse [log_id]]; nested: NumberFormatException[For input string: "abcd"];
Mappings
After an initial cluster is set up, an important step is to create a type mapping for the documents being indexed. The data type and other settings of each field for a specific document type are all stored in a type mapping which is configured using the Elasticsearch Put Mapping APl. Although a default mapping will be created by Elasticsearch when a new document is indexed, the default data types are often too general. For example, any field containing only an integer will be defaulted to type “long.”
In this case, manually setting it to type “integer” might better represent the data and simultaneously save some storage space. Also, any “string” fields need to be set to “not analyzed.” By default Elasticsearch will tokenize all string fields with an analyzer. This functionality is mainly implemented to support full text search and document scoring. For log data in general, queries are match only, so none of this is important. Setting the “index” field to “not_analyzed” ensures Elasticsearch won’t waste time unnecessarily tokenizing every string field.
Before creating a mapping, any default mapping created by Elasticsearch should be deleted in order to avoid conflicts. This will also delete any documents that have been indexed into this default mapping, so be careful. Mappings are deleted with the following command:
curl -XDELETE 'http://localhost:9200/logs/log/_mapping'
Here is an example of a put mapping command that specifies a simple mapping for documents with type “log” belonging to the index “logs” with three fields and their datatypes:
$ curl -XPUT 'http://localhost:9200/logs/_mapping/log' -d ' { "log" : { "properties" : { "Timestamp" : {"type" : "date"}, "URL" : {"type" : "string", "index": "not_analyzed"}, "IP Address" : {"type": "ip"}, "log_id" : {"type" : "integer"} } } }'
Once a mapping is set, it is important to note that indexing documents matching the mapping type, but with fields not included in the mapping will add said fields to the mapping.
For example, if, after applying the previous mapping, I tried to submit the following index request:
curl -XPOST 'http://localhost:9200/logs/log' -d '{
"Timestamp" : "2009-11-15T14:12:12",
"URL" : "opendns.com/enterprise-security",
"IP Address" : "127.0.0.1"
"log_id" : "2",
"user" : "John"
}'
The “user” field with type “string” would be added to the mapping automatically.
Document IDs
If unspecified, Elasticsearch will simply generate an ID for each document. This works fine in some cases, but often the user needs to be able to add their own ids.
In the most simple case, a document ID can be added to an index request itself as in the following:
curl -XPUT 'http://localhost:9200/logs/log/37' -d '{
"Timestamp" : "2009-11-15T14:12:12",
"URL" : "opendns.com/enterprise-security",
"IP Address" : "127.0.0.1"
}'
Simply change the request to “XPUT” and tack on the ID to the end of the URL.
Alternatively, a field can be added to the mapping along with a specified path to pull the ID from the document itself.
The following mapping will tell Elasticsearch to use the “log_id” field as the document ID:
$ curl -XPUT 'http://localhost:9200/logs/_mapping/log' -d ' { "log" : { "_id" : {"path" : "log_id"}, "properties" : { "Timestamp" : {"type" : "date"}, "URL" : {"type" : "string", "index": "not_analyzed"}, "IP Address" : {"type": "ip"}, "log_id" : {"type" : "integer"} } } }'
Index Schema and Templates
With a large amount of data coming in every day, it is important to have a comprehensive way of partitioning the data into Elasticsearch. For log data, it is often intuitive to partition the data into indices based on a time interval such as daily or hourly. Partitioning data in this way comes with several advantages. For one, data expiration becomes very easy.
Instead of relying on a TTL or other expiration methods, old indices can simply be deleted altogether. Another advantage comes when the data is queried. If a query is only looking for documents from a certain time period, it can be limited to fewer indices instead of having to query an entire cluster. This index schema is especially advantageous in the real time search use case. Since the most recent index will likely be receiving the majority of the traffic, Elasticsearch will maintain a larger cache for this index, improving performance.
This process of creating indices, along with settings and mappings, can be automated in Elasticsearch by using an “Index Template.” The job of an index template is to automatically apply mappings and other settings to an index at the time it is created. A basic index template will contain: a mapping for each type to be indexed, the name or wildcard expression matching the indices to which the template should be applied, and the number of shards and replicas each index should contain. All you have to do is index a document with an index name that matches one of your templates and the index will be automatically created using the template (assuming the index doesn’t already exist).
For example, if indexing DNS logs by the day the index naming schema might look something like “dnslog-YYYY-MM-DD,” with each subsequent index name incremented by one day. It would be too much work to apply settings and mapping to each index individually. Templates can be applied to every index matching this schema by creating a template first with a wildcard in the “template” field.
For example an index template that would be applied to every “dnslog-YYYY-MM-DD” index would look something like:
curl -XPUT localhost:9200/_template/dns_template -d '
{
"template" : "dnslog-*",
"settings" : {
"number_of_shards" : 3,
"number_of_replicas": 1
},
"mappings" : {
"log" : {
"properties" : {
"Timestamp" : {"type" : "date"},
"URL" : {"type" : "string",
"index": "not_analyzed"},
"IP Address" : {"type": "ip"}
}
}
}
}
'
Once this template has been applied, creating a new index with mapping and settings already applied is as simple as sending an index request.
Example:
curl -XPOST 'http://localhost:9200/dnslog-2015-04-09/log' -d '{
"Timestamp" : "2015-04-09T14:12:12",
"URL" : "opendns.com/enterprise-security",
"IP Address" : "127.0.0.1"
}'
After applying the previous template the command above will create the index “dnslog-2015-04-09” containing three shards with one replica and the “log” mapping already applied.
Conclusion
In this first post of our series, Elasticsearch has shown that it is flexible enough to be set up for log data using properly configured Index Templates and Type Mappings. In the next post in our series, we will explore the scalability and availability of Elasticsearch.
For more information on Elasticsearch, check out their website at http://www.elastic.co.
Continue to: Elasticsearch: You Know, For Logs [Part 2].