Mining DNS data can provide a security researcher valuable information in the hunt for malicious domains. For example, mining authoritative DNS logs helps establish historical records regarding changes in domain to IP mappings. Furthermore, by examining DNS traffic at the recursive level one can make rough estimates around what malicious sites are rising in popularity. However, despite its usefulness DNS data does have limitations. As solely network data – it does not give the researcher any information about content hosted on a certain domain. To augment our DNS data we have started to incorporate the publicly available SSL and reverse DNS datasets from Rapid7’s internet wide scan. This blog post will discuss the method we use to index the SSL datasets so they are easily queryable for researchers.

The SSL dataset consists of a SHA hash of an SSL cert, a Base64 encoding of the cert, the list of IPs associated with the hash, and common names associated with the hash. The reverse DNS data consists of IPs and their DNS ptr records. Each of these datasets also comes with a timestamp from the date of the scan. We decided to use Apache HBase as our storage solution. This was done because the key-value schema-less setup of HBase suited our data modeling requirements and because HBase integrated well with our existing data storage stack.

Apache HBase is a distributed key-value store that is meant for the schema-less storage of data. By schema-less, one means that it does not hold the same degree of rigidity as a SQL schema and can handle new nested columns on the fly. Design of your HBase table can be optimized by knowing what type of query patterns will be most likely asked. We wanted to be able to make the following queries regarding SSL certificates: identify all of the SSL certificates associated with a country’s IP range, view all of the IPs associated with a SSL certificate for either a given date or a range of dates, view the Base64 encoding of a certificate for a given hash. Luckily, these are all questions that can be easily handled by HBase if we carefully construct our table and rowkey.

A HBase table consists of a rowkey, one or more column families, column qualifiers, and a data cell that contains the value. The ‘key’ portion comes from the rowkey. The rowkey serves as a primary key in the table. In the case of the questions we want answered we know that our row key will either be a hash value or an IP. However, in our case we cannot have the rowkey consist solely of the hash or IP value. We also wanted the ability to search based on a combination of IP/Hash and a timestamp. Therefore, our row key should also contain a timestamp value and will take the form of [sha:timestamp] where the ‘:’ serves as a delimiter within the rowkey.

Screen Shot 2015-08-10 at 11.28.11 AM

The above figure is an example of the HBase table schema. The arrow indicates that the rowkey points to a set of column qualifiers who each have data stored in their name. The data in this case is the decimal representation of the IP address. Note that we do not use the data cell feature.

Every table upon creation requires a column family to be associated with the table. Luckily for us the column family in this case can be given the generic name ‘data’. Nested within the column family are the various column qualifiers. When we are given a (hash,timestamp) pairing as a rowkey and a specified column family we will have each column qualifier represent an IP address that was found to contain the hash. This will create a wide-schema because each IP will receive its own column qualifier associated with the hash, and store a null value in the value cell of the table. We choose this schema because it allows us to quickly retrieve IPs associated with a hash and perform a filter over the column qualifiers.

In some cases we will want to use the data cell. For example, the Base64 encoding of the certificate will be stored in a data cell belonging to a column qualifier named ‘cert’. An (IP, timestamp) pairing will utilize a similar table schema. The column qualifier in this case will be hash of the cert found on the requested IP. This HBase schema will allow us to easily scan over IP ranges to retrieve SSL hashes or easily retrieve a given Base64 encoded certificate depending on the hash. We relied upon the HappyBase python library and PIG to help us import the data into our HBase cluster. In upcoming posts we will discuss some results found from analyzing these data sets.

This post is categorized in: