A few hundred researchers from academia and industry gathered on Monday, July 1 for the 2nd annual Graphlab Workshop at the Nikko hotel in downtown San Francisco. The event was a great success in acting as a venue to discuss challenges and opportunities the emerging large scale graph analytics community currently faces. The Umbrella Security Labs team was present at the event, and in this blog we share with you our take-aways.

GraphLab

The first talk was about a product we have been excited about for quite some time called GraphLab.

GraphLab is a new framework for running graph-parallel algorithms, mainly developed at the Carnegie Mellon University over the past six years.

While the MapReduce programming model has proven to be excellent for data-parallel tasks on massive amounts of data, it’s not very suitable for graph algorithms. These algorithms are not easily expressed in the MapReduce model, and require copying a lot of data between iterations. Efforts like Haloop partially solve the later but not the former.

The Bulk Synchronous Parallel programming model is clearly a better fit. In this model, individual components (e.g. vertices of a graph) perform computations, possibly update their own state and communicate with other components using message passing. An iteration is complete after all components are done.

In the world of graph databases, the bulk synchronous parallel model was popularized in 2010 after Google released the Pregel paper.

And it is surprisingly simple. A single-node implementation of Google Pregel fits in 100 lines of code.

Vertex-centric computations make it easy to “parallelize” any algorithm. And partitioning can also become as simple as randomly assigning vertices to hosts.

But the barrier required between two iterations can be a serious bottleneck. In addition, this model requires storing two versions of all values (the previous iteration and the new value).

GraphLab’s primary motives were to avoid the bottleneck and to achieve exceptional performance on distributed graph processing.

The GraphLab core is a high-level C++ library (with Python and Java APIs) providing the following:

– compact in-memory graph storage. Graphs can be automatically distributed across the cluster.

– a clean API to let each vertex perform computations and updates.

– a set of schedulers to distribute tasks across all available CPUs.

– utilities for loading and saving graphs (locally or through HDFS), random number generators with chosen distributions, and generic utilities for writing portable code.

GraphLab uses MPI for inter-process communications, and job configuration can be read from Zookeeper.

The first version of GraphLab made it easy to express algorithms, but didn’t ensure race-free operations. One had to choose, for each algorithm, between strong consistency (hence no race conditions, but a slower runtime) and weak consistency, ignoring conflicting writes that can be acceptable for some algorithms.

The second version of GraphLab introduced a different programming model.

A “gather” phase retrieves information from other vertices, an “apply” phase updates the current vertex, and a “scatter” phase sends message to other vertices to prepare the next iteration. These three decomposable update functors can run asynchronously, and updates can happen in batch in order to optimize inter-node communications. This accelerates convergence of many numerical algorithms.

Thanks to these changes, this second version of GraphLab, code named “PowerGraph”, can efficiently process natural graphs. The graph of DNS queries we are processing definitely fits in this category, with top domains like google.com being adjacent to over 78% edges.

Umbrella Security Labs tried GraphLab 2 a couple months ago on our research 10-node cluster, and were impressed by the results. Algorithms running at high speed allowed us to quickly build new models and check their output on a complete data set.

Furthermore, a solid set of algorithms have already been implemented on top of this incredibly fast engine. They address a wide range of problems, from the domains of graph mining, to machine learning and linear algebra.

During the workshop, the next generation of the GraphLab framework, code named WarpGraph, has been unveiled. For starters, GraphLab is now available on Github, so anybody can seamlessly contribute new code and documentation to the project.

PowerGraph focused on performance. However, the programming models were slightly more complex than the original one, requiring a lot of contortions to make some algorithms fit.

WarpGraph focuses on usability by providing a simple way to write vertex-centric programs. Programs can now be written as coroutines implementing mini map/reduce operations. Performance is comparable to–and, in some cases better than–GraphLab 2.

A live demo of GraphLab accessed through an iPython notebook for interactive data analysis brillantly concluded the first presentation of the day.

GraphChi

GraphLab keeps the full graph and all values in (distributed) memory, which is why performance improvements have been mainly achieved by increasing paralellism and reducing communications.

GraphChi, another project maintained by the same team as GraphLab, addresses a different problem: single-machine computations on large graphs that don’t fit in memory. The only option to do so is obviously to store and update the graph on disk.

GraphChi’s API is very similar to GraphLab’s, and most algorithms can already run on both, or can be adapted with minimum effort.

At first, on-disk operations sound like a terrible idea for running graph algorithms, as even a SSD is order of magnitudes slower than main memory, especially for random access.

But GraphChi introduces an technique that actually makes on-disk graph processing extremely fast.

A reasonable assumption is that the update function will only need to read/write values from/to neighbor vertices.

In GraphChi, the adjacency set is split across shards, each with approximatively the same total number of edges, and so that the largest shard can fully fit in memory (actually 1/4 of the available memory, in order to keep room for other runtime data).

Within each shard, in-edges and out-edges are stored separately as a {source vertex, {list of edges}} array, in order to keep the number of disk seeks to a minimum when updating a vertex.

This array is indexed, and sorted by the source vertex. That way, accessing the edges of vertices within an interval for a given shard only requires sequential disk access, and the window can be much smaller than the full shard.

After preprocessing the graph in order to build the shards, GraphChi loads a full shard in memory. The update operation will be applied to this specific subgraph before commiting changes made to this shard to disk and moving over to the next shard.

Some neighbors required for the update operations will already be part of the in-memory shard and can be directly updated. Others are in different shards. For each of these on-disk shards, the relevant window (remember, the array is sorted by source vertex) is loaded and possibly updated.

An iteration is complete once all shards have been sequentially loaded in memory.

The beauty of this technique is that most I/O operations are sequential, and that it efficiently uses the main memory to limit the amount of operations. Unlike GraphLab, there is no interprocess communication involved: updates are directly made in memory before being flushed in batch to disk.

The GraphChi implementation also allows online updates, storing new edges in a temporary in-memory buffer that can eventually be materialized into actual shards.

Not only large graphs can be processed on commodity hardware, but benchmarks are impressive. A laptop running GraphChi frequently achieves results comparable to a 8 nodes cluster, provided that the graph has already been converted to a set of shards.

Last, but not least, a Java implementation of GraphChi is available. While slower than the C++ version, it can be used as a Pig UDF.

Pig is already our long-time best friend. Don’t tell him, but there’s a high probably that our lab will also give labradors and chihuahuas a lot of love in the months to come.

photo 1 (1)

 

Graph Processing at Facebook Scale

Another talk we liked is from Dr. Avery Ching, a contributor to Giraph, now working at Facebook. Guess what his talk is about.:)

We personally haven’t tried Giraph. As demoed in Dr. Avery’s talk, it works quite well with Facebook-scale data. It was certainly inspiring to get some of the ideas behind Giraph such as master computation, sharded aggregators etc. We here borrowed the illustration of the single-source shortest path algorithm in Giraph. To get started, check out Giraph’s website

 

 giraph

Large-Scale Graph Clustering in MapReduce and Beyond

Dr. Vahab Mirrokni from Google research, New York presented a nice clustering scheme on large-scale graphs. Clustering is very useful and is often the first exploratory step in unveiling hidden grouping structure of a big clump, requiring little prior knowledge of the data segments. It is obviously commonly seen in social network analysis, customer segmentation studies.

In the security domain, papers [1,2] were published where network traffic patterns were studied with clustering techniques. So back to Dr. Mirrokni’s talk, the focus is again to solve the clustering problem efficiently when the old solutions are challenged by the scale of the data. This is the overhauling slide for his talk, and below are the security-related papers that you may put on your reading list.

overlappingclustering

1. “Behavioral Clustering of HTTP-Based Malware and Signature Generation Using Malicious Network Traces”, Roberto Perdisci, Wenke Lee, and Nick Feamster, USENIX NSDI 2010 

2. “BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure-Independent Botnet Detection”, Guofei Gu, Roberto Perdisci, Junjie Zhang, and Wenke Lee, USENIX Security Symposium, 2008

 

GraphBuilder 2.0

Dr. Theodore Willke from Intel Labs, presented GraphBuilder 2.0, the scalable graph construction library for Hadoop developed at Intel. GraphBuilder is open source and written in Java which makes it easy to integrate with Hadoop Mapreduce. Its main advantage is that it frees domains experts from the complexities of preliminary graph construction, such as graph formation, tabulation, compression, transformation, partitioning, output formatting, and serialization.

GraphBuilder answers the need for an ETL (extract, transform, and load) sequence specific to large scale graphs. After the initial phase of parsing the data sources and extracting features of interest, the library user can build the edge list for the graph, and use any of the built-in functions like term frequency, or word count. In the next phase, the user can deduplicate edges, remove leaf nodes, self-loops, or transform a directed graph into an undirected one.

In the graph compression phase, dictionary-based compression and simple MapReduce compression algorithms are applied which are empirically efficient at conserving memory and storage. Next, comes the partitioning phase which uses efficient heuristics such as the balanced p-way vertex cut scheme. Most large scale graphs that appear in real-world problems are arbitrary (such as the web graph, or social networks) i.e. they are far from being regular or truly random. The problem of balanced partitioning of arbitrary graphs such as power-law graphs is NP-hard, hence the importance of using partitioning heuristics that are efficient in practice.

 

The event featured another 10 to 11 equally great talks and a dozen poster demo sessions, all informative and technically inspiring. The compact single day schedule turns out to be a very efficient way to share, learn and connect. At the end of the day, each attendee got a copy of the newly released “Graph Databases” book from O’Reilly and sponsored by Neo4j, something that would definitely help those who would like to dive a bit deeper into another aspect of big learning with graphs. 

 

 

This post is categorized in: