In my last blog post, we exposed a method to expand and draw generic graphs in three dimensions. As many people know, a graph can be used to represent a wide
range of problems or data structures. This time in this article, we will focus on the visualization of a specific case: a Random Forest.

What are Random Forests?

Technically speaking, “Random Forests” aren’t exactly a data structure but a machine learning framework to build a list of decision trees. So before we can
actually answer this question, let’s give a short overview of decision trees. Consider the following example:

animals

This model helps us decide and classify various types of animals given a variety of criteria. In this case, the criteria are fur, feathers, scales and gills. We can enter any input animal and be able to determine the animal’s family even if we have never seen that animal before.

Of course, the animal classification is now pretty much static, and, in the absence of hybrid or newly born species, it’s safe to say that it will rarely change. Famous biologists and naturalists like Charles Darwin have spent decades building this model by observation, so it is now generally accepted in this state.

In modern classification problems like Big Data security, owning such a precise and accurate decision tree would be a real gold mine. In our case, it would allow a 0% error in all automatic decisions and a 100% accuracy on all the predictions we make on our unknown incoming data. In practice, however, the perfect decision tree almost never exists, so we have to deal with error thresholds.

For a good analogy of DNS filtering, imagine be a naturalist trying to decide which animals are dangerous to humans when around 600,000 animals try to approach each second. They evolve extremely fast, they come from everywhere, some are harmless, some are ferocious, some are infected, some are infected without knowing it, and the most dangerous ones know your decision patterns and mutate to bypass the radars. So how can we build a decision tree for such dynamic wildlife?

It’s simple: we gather data, we classify it manually, and we propose criteria.

Entering the Random Forest

From the known data and criteria, the generation algorithm will create a list of decision trees that match our training set. So in fact, we don’t use only one but several decision trees and we ask all of their answers to reach a general decision.

In other words, it’s like taking several pictures of all the animals that you encounter, extracting key characteristics (claws, fangs, size, etc.), and
building a correlation among those who did or didn’t attack you. Next time you meet an unknown creature you can follow your decision tree and if it turned out to be incorrect, you remember it and recreate a decision pattern from your new (unfortunate) experience.,

Today we’re giving you a couple of pictures to look at. They show one of our own random forests taken a couple weeks ago. The white nodes are the leaves, the purple ones are the interior nodes. The label of the nodes describe the criteria, the error and the samples used in the decision step.

Screen Shot 2013-08-16 at 5.07.32 PM Screen Shot 2013-08-16 at 5.08.42 PM Screen Shot 2013-08-16 at 5.09.16 PM Screen Shot 2013-09-17 at 2.28.01 PM

For example: bibikun.ru

Bibikun.ru has been detected as a malicious domain by our classifiers; In fact, it is one of those fast flux domains we discovered. For more information on fast flux domains, check out this presentation recently given by Dhia on the topic. 

Let’s take a look at a small part of our web interface for this particular domain:

bibikun.ru features

This is data that we gathered and built based on our resolver logs. For example the first scores (ASN score, Prefix Score, RIP score) give us a global reputation score on its ASs, Prefixes and IPs. They will be the key values used in our decision tree creation.

Tracing the path of the decision inside the tree can help us understand many details. Here is what our JSON API response looks like when we request the decision path for this particular domain:

{"label":"bad","confidence":0.5,"score":-100.0,"z":0.0,
"path":[
{"vid":0,"fid":24,"sp":-0.8061,"z":0.5,"fname":"rip_score"},
{"vid":1,"fid":24,"sp":-3.7866,"z":0.9204902892407564,"fname":"rip_score"},
{"vid":2,"fid":24,"sp":-9.149,"z":0.9815885700931116,"fname":"rip_score"},
{"vid":3,"fid":22,"sp":0.2215,"z":0.9930534371394087,"fname":"pagerank"},
{"vid":4,"fid":23,"sp":-4.1412,"z":0.9936817098976564,"fname":"prefix_score"},
{"vid":5,"fid":22,"sp":0.0016,"z":0.9978593207864578,"fname":"pagerank"},
{"vid":6,"fid":20,"sp":47.7773,"z":0.9985154708204116,"fname":"popularity"},
{"vid":7,"fid":20,"sp":16.6064,"z":0.9986403806934059,"fname":"popularity"},
{"vid":8,"fid":4,"sp":0.2502,"z":0.9989628214910697,"fname":"div_rips"},
{"vid":14,"fid":18,"sp":-0.1001,"z":0.999116889601953,"fname":"asn_score"},
{"vid":15,"fid":23,"sp":-5.9561,"z":0.9993391077955239,"fname":"prefix_score"},
{"vid":16,"fid":17,"sp":1.5833,"z":0.9994826997738812,"fname":"name_split_mean"},
{"vid":20,"fid":1,"sp":86317.5,"z":0.9995031547005305,"fname":"ttls_median"},
...
{"vid":31,"fid":8,"sp":0.75,"z":0.9999068294046399,"fname":"rips_stability"},
{"vid":32,"fid":86,"sp":0.5,"z":0.9996971073754354,"fname":"796834E7A2"},
{"vid":33,"fid":23,"sp":-25.3884,"z":0.9997703964487984,"fname":"prefix_score"}],
"name":"bibikun.ru"}

Indeed, no less than 25 levels are traversed in order to receive a proper response. As you can see, the score of this domain is -100, so for us there is no doubt of its maliciousness. 

We can see that the first steps of the traversal are “rip_score,” “pagerank,” “prefix_score,” etc.). As you can imagine, they are criteria based on our reputation scores.
Since the tree creation heuristic includes some randomness, the logic behind the order of the sequence isn’t necessarily obvious, but the ‘vid’ field (or Vertex ID) gives an interesting way to visualize it.

This field points to the current node in the decision tree; starting from there it is now fairly simple to trace a path in the 3D visualization tool and observe the result.

Screen Shot 2013-09-19 at 11.03.49 AM

Nice, isn’t it ?

This big decision forest is one way to represent experience applied on a knowledge database. It is important for us to take a deep look at it and understand where the bottlenecks are. As the wildlife grows, we want to make sure less and less users get bitten everyday.

This post is categorized in: