Scan Statistics#

Scan statistics, as described in Priebe et al., is the generic method that computes a statistic for the neighborhood of each node in the graph, and looks for anomalies in those statistics. In this workflow, we implement a specific version of scan statistics where we compute the number of edges in the subgraph induced by the one-hop neighborhood of each node $u$ in the graph. It turns out that this statistic is equal to the number of triangles that node $u$ participates in plus the degree of $u$. Thus, we are able to implement scan statistics by making relatively minor modifications to our existing Gunrock triangle counting (TC) application.

Summary of Results#

Scan statistics applied to static graphs fits perfectly into the Gunrock framework. Using a combination of ForAll and Intersection operations, we outperform the parallel OpenMP CPU reference by up to 45.4 times speedup on the small Enron graph (provided as part of the HIVE workflows) and up to a 580 times speedup on larger graphs that feature enough computation to saturate the throughput of the GPU.

Algorithm: Scan Statistics#

Input is an undirected graph w/o self-loops.

scan_stats = [len(graph.neighbors(u)) for u in graph.nodes]
for (u, v) in graph.edges:
    if u < v:
        u_neibs = graph.neighbors(u)
        v_neibs = graph.neighbors(v)
        for shared_neib in intersect(u_neibs, v_neibs):
            scan_stats[shared_neib] += 1
return argmax([scan_stats(node) for node in graph.nodes])

Summary of Gunrock implementation#

max_scan_stat = -1
node = -1
src_node_ids[nodes]
scan_stats[nodes]

ForAll (src_node_ids, scan_stats): fill scan_stats with the degree of each node.
Intersection (src_node_ids, scan_stats): intersect neighboring nodes of both
  nodes of each edge; add 1 to scan_stats[common_node] for each common_node
  we get from the intersection.

return [scan_stats]

How To Run This Application on DARPA’s DGX-1#

Prereqs/input#

git clone --recursive https://github.com/gunrock/gunrock \
    -b dev-refactor
cd gunrock/tests/ss/
make clean && make

Running the application#

Application specific parameters:

Example command-line:

./bin/test_ss_main_10.0_x86_64 \
    --graph-type=market \
    --graph-file=./enron.mtx \
        --undirected --num_runs=10

Output#

The output of this app is an array of uint values: the scan statistics values for each node. The output file will be in .txt format with the aforementioned values.

We compare our GPU output with the HIVE CPU reference implementation implemented using OpenMP.

Details of the datasets:

DataSet

$\cardinality{V}$

$\cardinality{E}$

#dangling vertices

enron

15056

57074

0

ca

108299

186878

85166

amazon

548551

1851744

213688

coAuthorsDBLP

299067

1955352*

0

citationCiteseer

268495

2313294*

0

as-Skitter

1696415

22190596*

0

coPapersDBLP

540486

30481458*

0

pokec

1632803

30622564

0

coPapersCiteseer

434102

32073440*

0

akamai

16956250

53300364

0

soc-LiveJournal1

4847571

68475391

962

europe_osm

50912018

108109320*

0

hollywood-2009

11399905

112751422*

32662

rgg_n_2_24_s0

16777216

265114400*

1

Running time in milliseconds:

GPU

Dataset

Gunrock GPU

Speedup vs. OMP

OMP

P100

enron

0.461

45.4

20.95

P100

ca

0.219

71.6

15.681

P100

amazon

1.354

74.5

100.871

P100

coAuthorsDBLP

1.569

88.0

138.111

P100

citationCiteseer

3.936

65.22

256.694

P100

as-Skitter

111.738

579.22

64721.414

P100

coPapersDBLP

226.672

25.4

5766.4

P100

pokec

202.185

80.7

16316.474

P100

coPapersCiteseer

451.582

16.34

7378.188

P100

akamai

151.47

5.13

12596.36

P100

soc-LiveJournal1

1.548

2.59

4.016

P100

europe_osm

9.59

153.35

1470.632

P100

hollywood-2009

10032.46

18.76

188206.234

P100

rgg_n_2_24_s0

539.559

29.45

15887.536

Performance and Analysis#

We measure the performance by runtime. The whole process runs on the GPU; we don’t include data copy time and regard it as part of preprocessing. The algorithm is deterministic so does not require any accuracy comparison.

Implementation limitations#

Since the implementation only needs number of nodes’s size of memory allocated on the GPU, so the largest dataset that can fit on a single GPU is limited by GPU on-die memory size / number of nodes.

Our implementation regards the graph as undirected and is limited to graphs of that type.

Performance limitations#

We currently use atomic adds to accumulate the number of triangles for each node; atomic operations are relatively slow.

Next Steps#

Alternate approaches#

The most time-consuming operation for scan statistics is the intersection operator. We believe we can improve its performance in future work.

Currently we divide the edge lists into two groups: (1) small neighbor lists and (2) large neighbor lists. We implement two kernels (TwoSmall and TwoLarge) that cooperatively compute intersections. Our TwoSmall kernel uses one thread to compute the intersection of a node pair. Our TwoLarge kernel partitions the edge list into chunks and assigns each chunk to a separate thread block. Then each block uses the balanced path primitive to cooperatively compute intersections. However, these two kernels do not cover the case of one small neighbor list and one large neighbor list. By using a 3-kernel strategy and carefully choosing a threshold value to divide the edge list into two groups, we could potentially process intersections with the same level of workload together, gaining better load balancing and a higher GPU resource utilization.

Gunrock implications#

Gunrock is well-suited for this implementation. The major challenge is the need for accumulation when doing intersection. Intersection was originally designed to count the total number of triangles. But in scan statistics, we instead need the triangle count for each node, which introduces an extra atomic add when doing the accumulation if multiple edges share the same node, since we count triangles in parallel for each edge.

Notes on multi-GPU parallelization#

Non-ideal graph partitioning could be a bottleneck on a multi-GPU parallelization. Because we need the info of each node’s two-hop neighbors, an unbalanced workload will decrease the performance and increase the communication bottleneck.

Notes on dynamic graphs#

Scan statistics as a workload certainly has a dynamic-graph component: in its general form, it inputs a time-series graph and tries to identify any abnormal behavior in any statistics change (as mentioned in the referenced papers). To support this general workload, we would need Gunrock to support dynamic graphs in its advance and intersection operators. The application could be easily changed to record all history scan statistics and then to try to find any significant changes given a certain threshold.

Notes on larger datasets#

For a dataset which is too large to fit into a single GPU, we can leverage the multi-GPU implementation of Gunrock to make it work on multiple GPUs on a single node. The implementation won’t change a lot since Gunrock already has good support in its multi-GPU implementation. We expect the performance and memory usage to scale linearly with good graph partionling method.

For a dataset that cannot fit onto multiple GPUs on a single node, we need a distributed level of computation, which Gunrock doesn’t support yet. However, we can leverage open-source libraries such as NCCL and Horovod that support this. Performance-wise, the way of partitioning the graph as well as the properties of a graph will affect the communication bottleneck. Since we need to calculate the total number of triangles each node is involved in, if we cannot fit the entire neighborhood of a node on a single node, we need other compute resources’ help to compute its scan statistic. The worst-case senario is if the graph is fully connected, in which case we must to wait for the counting results from all other compute resources and then sum them up. In this case, if we can do good load-balanced scheduling, we can potentially minimize the communication bottleneck and reduce latency.

Notes on other pieces of this workload#

The other important piece of this work is statistical time series analysis on dynamic changes of the graph. We hope to add dynamic graph capability in the future.

Research value#

This application leverages a classic triangle-counting problem to solve a more complex statistics problem. Instead of directly using the existing solution, we solve it from a different angle: rather than counting the total number of triangles in the graph, we instead count triangles from each node’s perspective.

From this app, we see the flexibility of Gunrock’s intersection operation. This operation could potentially be used in other graph analytics problems such as community detection, etc.

References#

Scan Statistics on Enron Graphs

Statistical inference on random graphs: Comparative power analyses via Monte Carlo