Application Classification#

The application classification (AC) workflow is an implementation of probabalistic graph matching via belief propagation. The workflow takes two node- and edge-attributed graphs as input – a data graph G = (U_G, E_G) and a pattern graph P = (U_P, E_P). The goal is to find a subgraph S of G such that the dissimilarity between the node/edge features of P and S is minimized. The matching is optimized via loopy belief propagation, which consists of iteratively passing messages between nodes then updating beliefs about the optimal match.

Summary of Results#

Application classification involves a number of dense-matrix operations, which did not make it an obvious candidate for implementation in Gunrock. However, our GPU implementation using the CUDA CUB library shows substantial speedups (10-50x) over the multi-threaded OpenMP implementations.

However, there are two neighbor reduce operations that may benefit from the kind of load balancing implemented in Gunrock. Thus, it would be useful to either expose lightweight wrappers of high-performance Gunrock primitives for easy intergration into outside projects or come up with a workflow inside of Gunrock that makes programming applications with lots of non-graph operations straightforward.

Summary of CUDA Implementation#

We implement application classification from scratch in CUDA using CUB rather than Gunrock. Application classification requires the following kernels:

compute pairwise distance between rows of dense matrices
normalize the columns of a dense matrix
compute maximum of columns in a dense matrix
gather/scatter rows of dense matrix
sum/max neighborhood reduce on the data/pattern graphs

Apart from the last one, these kernels do not obviously map to Gunrock’s advance/filter model.

Pseudocode for the core belief propagation algorithm is as follows:

for iteration in range(num_iterations):

    # Update edge messages
    edge_msg_f = diff_r[data_edges.srcs] - edge_distances # data.num_edges x pattern.num_edges
    edge_msg_r = diff_f[data_edges.srcs] - edge_distances # data.num_edges x pattern.num_edges

    # Normalize edge messages
    edge_msg_f = normprob(edge_msg_f)
    edge_msg_r = normprob(edge_msg_r)

    # Max of forward/backward messages
    f_null = columnwise_max(diff_f)                           # 1 x pattern.num_edges
    r_null = columnwise_max(diff_r)                           # 1 x pattern.num_edges
    for edge_idx, (src, dst) in enumerate(data_edges):
        max_r[src] = max(max_r[src], edge_msg_r[edge_idx], f_null)
        max_f[dst] = max(max_f[dst], edge_msg_f[edge_idx], r_null)

    # Update beliefs
    belief = - node_distances                              # data.num_nodes x patt.num_nodes
    for edge_idx, (src, dst) in enumerate(pattern_edges):
        belief[:,dst] += max_f[:,edge_idx]
        belief[:,src] += max_r[:,edge_idx]

    # Normalize beliefs
    belief = normprob(belief)

    diff_f = belief[:,pattern_edges.dsts] - max_f          # data.num_nodes x pattern.num_edges
    diff_r = belief[:,pattern_edges.srcs] - max_r          # data.num_nodes x pattern.num_edges

where normprob is the column-wise log-softmax function. That is, since belief is data.num_nodes x patt.num_nodes, each node in the pattern graph gets assigned a probability distribution over nodes in the data graph.

Our implementation is based on the PNNL implementation rather than the distributed GraphX reference implementation. On all of the graphs we’ve tested, the output of our implementation exactly matches the output of the PNNL code. According to PNNL, their implementation may give different results than the HIVE reference implementation (due to e.g., different normalization schemes).

How To Run This Application on DARPA’s DGX-1#

Prereqs/input#

git clone --recursive \
        https://github.com/owensgroup/application_classification
cd application_classification
make clean
make

Running the application#

Example Command#

mkdir -p results
./main \
    ./data/georgiyData.Vertex.csv \
    ./data/georgiyData.Edges.csv \
    ./data/georgiyPattern.Vertex.csv \
    ./data/georgiyPattern.Edges.csv > results/georgiy

Example output#

$ head -n 5 results/georgiy
-8.985810e+00
-3.859019e+01
-4.470994e+01
-1.673157e+01
-1.730952e+01
$ tail -n 5 results/georgiy
-2.886186e+01
-1.499165e+01
-4.034595e+01
-1.060496e+01
-7.684015e+00
$ cat results/georgiy | openssl md5
(stdin)= bd57a5126d5f943ad5c15408d410790d

Expected Output#

The output of the program is a data.num_nodes x pattern.num_nodes matrix, where each column represents a log-probability distribution of assignments of pattern node j to data node i. This matrix is printed in row major order as a file with data.num_nodes x pattern.num_nodes lines.

In python, you could inspect the output like:

import numpy as np

# load results
x = open('./results/georgiy').read().splitlines()
x = [float(xx) for xx in x]

# reshape to (data.num_nodes, pattern.num_nodes)
x = np.reshape(x, (1000, 10))

# exponentiate
x = np.exp(x)

# columns should sum to 1
x.sum(axis=0)
# array([1.0000001 , 1.00000001, 1.00000004, 0.99999999, 0.99999988,
       # 0.99999994, 0.99999985, 1.00000001, 1.        , 0.99999988])

As previously mentioned, our output exactly matches the output of the PNNL implementation on all of the graphs we have tested.

Performance and Analysis#

We measure performance by runtime of the algorithm given

a node- and edge-attributed data graph
a node- and edge-attributed pattern graph

Though AC computes probabilities of matches between data and pattern nodes, this is a deterministic and parameterless algorithm, so we do not measure performance in terms of accuracy. Further, runtime does not depend on the values of the node/edge attributes, so we can reasonably run experiments using randomly generated values.

Implementation limitations#

As currently implemented, the algorithm allocates a number of arrays of floats:

3 arrays of size data.num_edges * pattern.num_edges
3 arrays of size data.num_nodes * pattern.num_edges
2 arrays of size data.num_nodes * pattern.num_nodes

The combined memory usage of these arrays will be the primary bottleneck for scaling. For example, if

data.num_nodes = 6M
data.num_edges = 18M
pattern.num_nodes = 10
pattern.num_edges = 20

then the memory footprint will be on the order of 13 GB. (Note: It’s likely that reordering of operations could reduce the number of intermediate data structures required.)

AC is only applicable to graphs with node and edge features. However, the runtime of the algorithm is only dependent on the dimension of these features, rather than the values. Thus, for benchmarking purposes, we can pick a node_feature_dim and edge_feature_dim and use randomly generated features. This is helpful becaus we do not have a good real-world dataset for benchmarking.

The algorithm requires both a data graph and a smaller pattern graph. Both the size and topology of the pattern graph may affect the runtime of the algorithm. However, we do not have examples of actual pattern graphs apart from georgiyPattern, so we are forced to generate them synthetically. Specifically, we do this by sampling a node q, sampling up to max_pattern_nodes number of u to form set Q, and using the subgraph induced by Q + q as the pattern graph.

We suspect the PNNL implementation may have a couple of minor bugs:

the CV matrix is log-softmax normalized in the initialization phase, updated with values from FMax and RMax, and log-softmax normalized again. This kind of double normalization seems strange, and is perhaps incorrect.
In our RepeatColumnsByDataEdges, FE and RE both use the src of an edge, which conflicts with the algorithm description in the paper. One of these should probably be using the src and the other the dst.

These bugs are easy to fix, but we left them “as is” because a) we have no easy way to verify which is correct and b) we’d like consistency of results w/ the PNNL implementation.

Comparison against existing implementations#

The original reference implementation consisted of a large amount of distributed Spark and GraphX Scala code. For ease of implementation, and to make sure performance comparisons are meaningful, we instead based our implementation on the PNNL ApplicationClassification OpenMP code.

Overall, the CUDA implementation is between 10x and 100x faster than the best run of the PNNL OpenMP code.

We compare our CUDA implementation to PNNL’s C++ OpenMP implementation on several different graphs:

georgiyData#

a small graph included w/ real (source unknown) node/edge features (included in PNNL repo)
|U|=1000 |E|=20135 node_feat_dim=13 edge_feat_dim=17

rmat18#

a scale 18 RMAT synthetic graph w/ random node/edge features (included in PNNL repo)
|U|=262145 |E|=2008660 node_feat_dim=13 edge_feat_dim=17

JohnsHopkins_random#

Social network graph w/ random node/edge features
|U|=5157 |E|=186572 node_feat_dim=12 edge_feat_dim=16

For the first two, we use the georgiyPattern pattern graph included in the PNNL repo. For the latter, we generate a pattern graph using the neighborhood induction mentioned above.

georgiyData#

implementation	threads	elapsed_ms
PNNL OpenMP	1	1635.774136
PNNL OpenMP	2	1405.072927
PNNL OpenMP	4	1005.914927
PNNL OpenMP	8	831.342936
PNNL OpenMP	16	793.069839
PNNL OpenMP	32	546.305180
PNNL OpenMP	64	706.761122
Our CUDA	1xP100	42.533

Takeaway: Our CUDA implementation is approximately 12.8x faster than the fastest OpenMP run (32 threads). However, this problem is quite small and both implementations run in under 1 second.

rmat18#

implementation	threads	elapsed_ms
PNNL OpenMP	1	113337.949038
PNNL OpenMP	2	142036.607981
PNNL OpenMP	4	109564.634800
PNNL OpenMP	8	95680.689096
PNNL OpenMP	16	87083.579063
PNNL OpenMP	32	88772.798061
PNNL OpenMP	64	82495.028973
Our CUDA	1xP100	827.573

Takeaway: Our CUDA implementation is approximately 99x faster than the fastest PNNL run (64 threads). The absolute magnitude of the differences is much more substantial here – the CUDA implementation runs in < 1 second while the OpenMP version runs in ~ 1.5 minutes.

JohnsHopkins_random#

implementation	threads	elapsed_ms
PNNL OpenMP	1	71190.566063
PNNL OpenMP	2	44293.390989
PNNL OpenMP	4	31736.392021
PNNL OpenMP	8	24292.662144
PNNL OpenMP	16	21556.239128
PNNL OpenMP	32	17082.650900
PNNL OpenMP	64	19473.608017
Our CUDA	1xP100	450.897

Takeaway: Our CUDA implementation is approximately 37x faster than the fastest PNNL run (32 threads). Again, the absolute difference in runtimes is more substantial, with our code running in < 1 second vs. ~ 20 seconds.

Performance limitations#

Profiling (on the RMAT graph) reveals that the distribution of runtime over kernels is flatter in AC than for many applications – often, a single kernel will account for > 80% of runtime, but here the most expensive kernel only accounts for 12.8% of compute time.

12.8% of time spent in __reorderColumns, which is a gather/scatter operation (memory bandwidth = 281 GB/s)
10.5% of time spent in __transpose
12.1% of time spend in __rowSubLog, an edgewise arithmetic operation
9.5% of time computing data.num_edges x pattern.num_edges similarity matrix
8.1% of time doing neighborhood reductions (293 GB/s)

All of these are memory bound operations.

As mentioned above, the memory use of the app could be reduced (if need be) by reducing the number of intermediate data structures. This would come at the cost of increased (re)computation time.

Next Steps#

Alternate approaches + Gunrock implications#

In the CUDA implementation, there are a number of places where we could take advantage of multiple streams to reduce runtimes. For example, in the initialization phase, we compute the pairwise distance between a) data/pattern node features and b) data/pattern edge features. These operations are completely independent of one another, and so could happen asynchronously on different streams.

The application was implemented outside of the Gunrock framework because it had a large number of (dense matrix) operations that are not explicitly supported by Gunrock, and a relatively small number of kernels that map well to Gunrock’s advance/filter paradigm. However, the code uses CUB’s DeviceSegmentedReduce a number of times – Gunrock recently added a similar operator that is load-balanced for better performance. In the future, it would be worthwhile to see what kind of speedup we could get from the Gunrock version, which should roughly be a drop-in replacement.

Notes on multi-GPU parallelization#

Most of the kernels are either row or column operations (reductions) over dense matrices, and thus relatively easy to partition over multiple nodes. They would either end up being embarrassingly parallel or would require a per-node reduction and then a reduction across nodes. Replacing CUB’s DeviceSegmentedReduce with Gunrock’s implementation would give us multi-GPU support for the remaining kernel.

Alternatively, depending on the topology of the graph, we may be able to partition the data graph so that we can duplicate the pattern graph across nodes and run an independent instance of application classification on each partition. The partition would need to be constructed in a way that ensures that every subgraph is intact on some GPU, which implies some partial duplication of the data graph. If the data graph has a large diameter, or the pattern graph has a small diameter, this may be possible without excessive duplication. If the data graph has a small diameter, we may still be able to partition the graph by e.g. removing edges that are particularly dissimilar from edges in the pattern graph. This kind of approach is clearly very application specific, and may not be possible at all in some cases.

Notes on dynamic graphs#

In practice, it’s likely that practitioners would like to run application classification on a dynamic graph (e.g., the HIVE use case was for detecting certain suspicious patterns of communication in a cybersecurity graph). However, it is not obvious how the current algorithm would be applied in a streaming fashion without relatively major modifications. It’s more likely that the current AC algorithm would be applied to a dynamic graph via some kind of sliding window.

Notes on larger datasets#

We may be able to use a partitioning scheme like the one described in the multi-GPU section above to handle data graphs that are larger than GPU memory.

Notes on other pieces of this workload#

The documentation on the wiki includes discussion of the various featurization methods used to produce the node/edge attributes. These are beyond the scope of this work, but do include things such as computing degree, number of triangles, betweenness centrality, etc. If we wanted to build a high-performance end-to-end application classification system, we would want to implement some of these featurization methods in Gunrock as well.

Application Classification

Contents

Application Classification#

Summary of Results#

Summary of CUDA Implementation#

How To Run This Application on DARPA’s DGX-1#

Prereqs/input#

Running the application#

Example Command#

Example output#

Expected Output#

Performance and Analysis#

Implementation limitations#

Comparison against existing implementations#

georgiyData#

rmat18#

JohnsHopkins_random#

georgiyData#

rmat18#

JohnsHopkins_random#

Performance limitations#

Next Steps#

Alternate approaches + Gunrock implications#

Notes on multi-GPU parallelization#

Notes on dynamic graphs#

Notes on larger datasets#

Notes on other pieces of this workload#