Another one got caught today, it’s all over the papers. “Teenager Arrested in Computer Crime Scandal”, “Hacker Arrested after Bank Tampering”… Damn kids. They’re all alike.
You may have recognized the opening lines of this now legendary text. The Hacker’s Manifesto, first published in Phrack #7 in 1986, was written by “The Mentor” shortly after his arrest. It is now part of the common hacker knowledge and stays a monument of the cyber culture. Today, we would like to give it a new lease on life using OpenGraphiti, our data visualization engine.
In this article, we will present you a way to do some text analysis with OpenGraphiti combined with NLTK, the Natural Language Toolkit.
First let’s have a quick look at what NLTK is and does :
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and an active discussion forum.
(Source : http://www.nltk.org)
In other words, NTLK is a text processor for human languages such as English, Spanish, French, Chinese … We will use it to parse our Hacker’s Manifesto and analyze the output with OpenGraphiti. This will bring a new light on the structure of that text and the way NLTK parses words and sentences in a unique and visual way. Obviously, this technique can be very well applied to any other text, our use case here only serves as an example.
Parsing the text with NLTK
Now we will assume that you have NLTK installed on your machine and that you have a file called “manifesto.txt” containing the text to process. The relevant Python code to parse your text data looks like this :
data = list() with open("manifesto.txt", "rU") as infile: text = infile.read() print("Parsing sentences and tagging words with NLTK ...") sentences = nltk.sent_tokenize(text) for sentence in sentences: tokens = nltk.word_tokenize(sentence) tagged = nltk.pos_tag(tokens) data.append(tagged) print(tagged)
Fairly straightforward indeed :
- Open the file, read it.
- Cut the text into sentences
- Foreach sentence, cut it into words
- Tag the words with an NLTK type
- Print the result
In order to illustrate NLTK’s mechanism, let’s just focus on this sentence :
Have you ever looked behind the eyes of the hacker ? Did you ever wonder what made him tick, what forces shaped him, what may have molded him ?
For that specific sentence, here is what NLTK would give us :
[('Have', 'NNP'), ('you', 'PRP'), ('ever', 'RB'), ('looked', 'VBN'), ('behind', 'IN'), ('the', 'DT'), ('eyes', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('hacker', 'NN'), ('?', '.')] [('Did', 'NNP'), ('you', 'PRP'), ('ever', 'RB'), ('wonder', 'JJR'), ('what', 'WP'), ('made', 'VBN'), ('him', 'PRP'), ('tick', 'VBP'), (',', ','), ('what', 'WP'), ('forces', 'NNS'), ('shaped', 'VBD'), ('him', 'PRP'), (',', ','), ('what', 'WP'), ('may', 'MD'), ('have', 'VB'), ('molded', 'VBN'), ('him', 'PRP'), ('?', '.')]
As you can see, the whole text has been cut into sentences. Those sentences are represented by arrays of words. Each word is represented by a pair of elements. The first one is the word, the second is the NLTK type. Now how do we know what they mean ? Well, NLTK provides a very simple way to read the documentation about those types. For example, let’s focus on the tag associated with the word “looked” in the first sentence. (VBN)
$ python >>> import nltk >>> nltk.help.upenn_tagset('VBN') VBN: verb, past participle multihulled dilapidated aerosolized chaired languished panelized used experimented flourished imitated reunifed factored condensed sheared unsettled primed dubbed desired ...
Fair enough! NLTK gives us a code to communicate the type/function of a word in a sentence. Now all we have to do is use SemanticNet to transform our tagged tokens into a nice graph.
Build the graph
The process is fairly simple, we will just parse all the sentences and create a connected path between the succession of words in the sentence. If a word or edge has already been created, we don’t recreate it. If we apply this on the whole text, this will give us a graph of words connected when they appear next to eacher in the text. And finally we can type them with the NLTK tag.
Example :
“NLTK is amazing. OpenGraphiti is great too.”
This will be parsed and tagged like this :
[(u'NLTK', 'NN'), (u'is', 'VBZ'), (u'amazing', 'VBG'), (u'.', '.')] [(u'OpenGraphiti', 'NNP'), (u'is', 'VBZ'), (u'great', 'JJ'), (u'too', 'RB'), (u'.', '.')]
We can then create a graph defined as follows :
Nodes : NLTK, is, amazing, ., OpenGraphiti, great, too Edges : NLTK --> is, is --> amazing, amazing --> ., OpenGraphiti --> is, is --> great, great --> too, too --> .
Let’s take a look at our creation algorithm using SemanticNet to perform that task.
import semanticnet as sn graph = sn.DiGraph() for sentence in data: previous = None for token in sentence: word = token[0].lower() type = token[1] if not graph.has_node(token[0]): current = graph.add_node({ "label" : word, "type" : type}, word) else: current = word if previous is not None: edges = graph.get_edges_between(previous, current) if not edges: graph.add_edge(previous, current) previous = current graph.save_json("manifesto.json")
Notes
- We use the lower() method to transform everything in lowercase. (OpenGraphiti and opengraphiti would be treated equally)
- We don’t make the distinction between different types on the same word. We keep whichever appears first.
Going further
We could spice things up a little bit. For instance, we could count the number of occurrences of each edge. That would give us a Markov graph of all the word transitions. That can also be visualized.
Another idea is to create a timeline to play the text over time and simulate it as if it was spoken in realtime!
Visualization
Once our SemanticNet graph has been created, we can visualize it with the OpenGraphiti engine :
$ ./graphiti demo manifesto.json
We can now contemplate the result in three dimensions! We captured a video of the engine in action and we are happy to share it with you. This video first shows you the resulting graph created with the technique described above, and then plays the text over time.
We hope you enjoyed this tutorial. This use case hilights an interactive way to dig into the NLTK parser mechanism. It could very well be applied to other language parsers or text processors. It is only a simple use case to expose how we can explore complex semantic data with a little bit of Python and the OpenGraphiti framework!
References
- OpenGraphiti project : http://www.opengraphiti.com
- The Hacker’s Manifesto : http://phrack.org/issues/7/3.html
- Markov Chain on Wikipedia : http://en.wikipedia.org/wiki/Markov_chain