skylerPurpose

SemanticNet is a small Python library written to assist in the generation of data sets for the OpenGraphiti graph visualization tool. Because OpenGraphiti loads its graphs from a custom JSON representation of graphs, it is necessary to have a convenient way of generating the JSON graphs.

Thus, the purpose of SemanticNet is to provide a standard and easy mechanism to do this. The goal of this project is to make it easy to let the user focus on the semantics, rather than the mechanics, of their graph, while providing a way to do more complex graph operations for those who need it.

That is, this is not Yet Another Graph Library, but rather a simplified interface for the creation of semantic graphs in JSON.

Description

SemanticNet is a Python module to assist in the creation of semantic graphs in a JSON format. Underlying SemanticNet is the networkx graph library, which can be accessed and used for more complex graph operations.

JSON representation

The graphs used by OpenGraphiti are represented as one might expect. Suppose you have a graph G = (V, E), where

V = {0, 1, 2} and
E = {(0, 1), (0, 2), (1, 2)}

Suppose further that:

  1. Vertex 0 has the attributes: {"type": "A", "id": 0}
  2. Vertex 1 has the attributes: {"type": "B", "id": 1}
  3. Vertex 2 has the attributes: {"type": "C", "id": 2}
  4. Edge (0, 1) has the attributes: {'src': 0, 'dst': 1, 'type': 'normal', 'id': 0}
  5. Edge (0, 2) has the attributes: {'src': 0, 'dst': 2, 'type': 'normal', 'id': 1}
  6. Edge (1, 2) has the attributes: {'src': 1, 'dst': 2, 'type': 'irregular', 'id': 1}

then in JSON format, it would look like:

{
 "timeline": [],
 "nodes": [
  {
   "type": "A",
   "id": 0
  },
  {
   "type": "B",
   "id": 1
  },
  {
   "type": "C",
   "id": 2
  }
 ],
 "meta": {},
 "edges": [
  {
   "src": 0,
   "dst": 1,
   "type": "normal",
   "id": 0
  },
  {
   "src": 0,
   "dst": 2,
   "type": "normal",
   "id": 1
  },
  {
   "src": 1,
   "dst": 2,
   "type": "irregular",
   "id": 2
  }
 ]
}

As you can see, there is a list of "node" objects, each of which contain the node’s attributes and IDs, as well as a list of "edge" objects, each of which have the edge’s attributes, and the fields "src" and "dst", which indicate the source and destination vertices, respectively.

Node IDs

As previously stated, the goal of the project is to let the user focus on the semantics of their graph. As such, rather than forcing the user to specify each new node’s ID, the default behavior is to assign each new node with a randomly-generated UUID. So to generate the graph in the above section, one would do something like this:

>>> g = sn.Graph()
>>> a = g.add_node({'type': 'A'})
>>> b = g.add_node({'type': 'B'})
>>> c = g.add_node({'type': 'C'})
>>> c
UUID('a7faaba8-ec65-44d5-a911-86f3a087eeab')
>>> g.add_edge(a, b, {'type': 'normal'})
UUID('ad173385-4c76-474d-a75d-c861d5d72d9d')
>>> g.add_edge(a, c, {'type': 'normal'})
UUID('08833cd0-0d83-4e46-a3a1-3ac52e50c73e')
>>> g.add_edge(b, c, {'type': 'irregular'})
UUID('91518542-ea4a-4103-8b4c-a2f106c5aea5')
>>> g.save_json('output.json')

the contents of output.json would be similar to this, with different IDs:

{
 "timeline": [],
 "nodes": [
  {
   "type": "B",
   "id": "b7d3fbe83672412cadd6f0249455fb43"
  },
  {
   "type": "C",
   "id": "a7faaba8ec6544d5a91186f3a087eeab"
  },
  {
   "type": "A",
   "id": "8bec5b56d76c4282b3baa80454a9dee8"
  }
 ],
 "meta": {},
 "edges": [
  {
   "src": "b7d3fbe83672412cadd6f0249455fb43",
   "dst": "a7faaba8ec6544d5a91186f3a087eeab",
   "type": "irregular",
   "id": "91518542ea4a41038b4ca2f106c5aea5"
  },
  {
   "src": "b7d3fbe83672412cadd6f0249455fb43",
   "dst": "8bec5b56d76c4282b3baa80454a9dee8",
   "type": "normal",
   "id": "ad1733854c76474da75dc861d5d72d9d"
  },
  {
   "src": "a7faaba8ec6544d5a91186f3a087eeab",
   "dst": "8bec5b56d76c4282b3baa80454a9dee8",
   "type": "normal",
   "id": "08833cd00d834e46a3a13ac52e50c73e"
  }
 ]
}

However, sometimes it is advantageous to use non-random IDs. As such, because the underlying graph is a networkx graph, the ID can be any hashable type. The ID of a new node or edge can be specified by supplying the optional id_ argument.

>>> a = g.add_node({'type': 'A'}, 'a')
>>> a
'a'

Quick Example: File system

Here is a quick demonstration of how to generate a rooted tree from a directory of your filesystem:

#!/usr/bin/env python

import sys
import os
import argparse
import semanticnet as sn

def add_node(graph, root, label, node_type):
    data = {}

    if os.path.islink(root):
        data['type'] = 'link'

    path = os.path.join(root, label)

    if not graph.has_node(path):
        graph.add_node({"type": node_type, "label": label}, path)

    graph.add_edge(root, path, data)

if __name__ == "__main__":
    if len(sys.argv) < 1:
        print("Need a starting dir")
        sys.exit(-1)

    start = sys.argv[1]
    graph = sn.Graph()

    for root, dirs, files in os.walk(start, followlinks=True):
        print(root)

        if not graph.has_node(root):
            graph.add_node({'label': root, 'type': 'dir', 'depth': 0}, root)

        for d in dirs:
            add_node(graph, root, d, "dir")
        for f in files:
            add_node(graph, root, f, os.path.splitext(f)[1])

    graph.save_json("fs.json")

Here, we make use of Python’s built-in os.walk() method to traverse the given path. We add directories with the type attribute set to “dir”, and files set to their file extension. This way, OpenGraphiti will be able to color the nodes and edges by type (directory, symbolic link, type of file, etc.).

This example is included in the examples folder of the source code, with several other examples, all of which are documented in the README.

Caching

Should you come across a use case where you’d like quick references to nodes or edges by more than just the ID, semanticnet provides a mechanism to cache nodes and edges by any of their attributes. For example, suppose you make the following graph:

>>> g = sn.Graph()
>>> a = g.add_node({"type": "server"}, '3caaa8c09148493dbdf02c574b95526c')
>>> b = g.add_node({"type": "server"}, '2cdfebf3bf9547f19f0412ccdfbe03b7')
>>> c = g.add_node({"type": "client"}, '3cd197c2cf5e42dc9ccd0c2adcaf4bc2')
>>> g.add_edge(a, b, {"method": "GET", "port": 80}, '5f5f44ec7c0144e29c5b7d513f92d9ab')
UUID('5f5f44ec-7c01-44e2-9c5b-7d513f92d9ab')
>>> g.add_edge(a, c, {"method": "GET", "port": 80}, '7eb91be54d3746b89a61a282bcc207bb')
UUID('7eb91be5-4d37-46b8-9a61-a282bcc207bb')
>>> g.add_edge(b, c, {"method": "POST", "port": 443}, 'c172a3599b7d4ef3bbb688277276b763')
UUID('c172a359-9b7d-4ef3-bbb6-88277276b763')

Suppose further that you want to access the nodes by their "type" attribute. You can tell semanticnet to cache the nodes by the "type" attribute, and access them like so:

>>> g.cache_nodes_by("type")
>>> g.get_nodes_by_attr("type")
{'client': [{'type': 'client', 'id': UUID('3cd197c2-cf5e-42dc-9ccd-0c2adcaf4bc2')}], 'server': [{'type': 'server', 'id': UUID('2cdfebf3-bf95-47f1-9f04-12ccdfbe03b7')}, {'type': 'server', 'id': UUID('3caaa8c0-9148-493d-bdf0-2c574b95526c')}]}

Similarly, you could get a list of all connections by port:

>>> g.cache_edges_by("port")
>>> g.get_edges_by_attr("port")
{80: [{'port': 80, 'src': UUID('3caaa8c0-9148-493d-bdf0-2c574b95526c'), 'dst': UUID('3cd197c2-cf5e-42dc-9ccd-0c2adcaf4bc2'), 'id': UUID('7eb91be5-4d37-46b8-9a61-a282bcc207bb'), 'method': 'GET'}, {'port': 80, 'src': UUID('3caaa8c0-9148-493d-bdf0-2c574b95526c'), 'dst': UUID('2cdfebf3-bf95-47f1-9f04-12ccdfbe03b7'), 'id': UUID('5f5f44ec-7c01-44e2-9c5b-7d513f92d9ab'), 'method': 'GET'}], 443: [{'port': 443, 'src': UUID('2cdfebf3-bf95-47f1-9f04-12ccdfbe03b7'), 'dst': UUID('3cd197c2-cf5e-42dc-9ccd-0c2adcaf4bc2'), 'id': UUID('c172a359-9b7d-4ef3-bbb6-88277276b763'), 'method': 'POST'}]}

and you can specify the attribute value as well, to return the list of connections by, say, port 80:

>>> g.get_edges_by_attr("port", 80)
[{'port': 80, 'src': UUID('3caaa8c0-9148-493d-bdf0-2c574b95526c'), 'dst': UUID('3cd197c2-cf5e-42dc-9ccd-0c2adcaf4bc2'), 'id': UUID('7eb91be5-4d37-46b8-9a61-a282bcc207bb'), 'method': 'GET'}, {'port': 80, 'src': UUID('3caaa8c0-9148-493d-bdf0-2c574b95526c'), 'dst': UUID('2cdfebf3-bf95-47f1-9f04-12ccdfbe03b7'), 'id': UUID('5f5f44ec-7c01-44e2-9c5b-7d513f92d9ab'), 'method': 'GET'}]

The cache is managed automatically. Any time you add or remove a node/edge with an attribute that you are caching, or modify an attribute of a node/edge, semanticnet updates the cache.

Closing Remarks

It is our hope that through the use of OpenGraphiti and SemanticNet, users will be able to easily visualize any semantic network they can think of. Both projects are open source, and their source code can be found on GitHub.

The SemanticNet package is available now.

This post is categorized in: