GraphSearch#

The graph search (GS) workflow is a walk-based method that searches a graph for nodes that score highly on some arbitrary indicator of interest.

The use case given by the HIVE government partner was sampling a graph: given some seed nodes, and some model that can score a node as “interesting”, find lots of “interesting” nodes as quickly as possible. Their algorithm attempts to solve this problem by implementing several different strategies for walking the graph.

  • uniform: given a node u, randomly move to one of u’s neighbors (ignoring scores)

  • greedy: given a node u, walk to neighbor with maximum score

  • stochastic_greedy: given a node u, choose neighbor to walk to with probability proportional to score

Use of these walk-based methods is motivated by the presence of homophily in many real world social networks: we expect interesting people to have relationships with interesting people.

Summary of Results#

Graph search is a relatively minor modification to Gunrock’s random walk application, and was straightforward to implement. Though random walks are a “worst case scenario” for GPU memory bandwidth, we still achieve 3–5x speedup over a modified version of the OpenMP reference implementation.

The original OpenMP reference implementation actually ran slower with more threads – we fixed the bugs, but the benchmarking experience highlights the need for performant and hardened CPU baselines.

Until recently, Gunrock did not support parallelism within the lambda functions run by the advance operator, so neighbor selection for a given step in the walk is done sequentially. Methods for exposing more parallelism to the programmer are currently being developed via parallel neighbor reduce functions.

In an end-to-end graph search application, we’d need to implement the scoring function as well as the graph walk component. For performance, we’d likely want to implement the scoring function on the GPU as well, which makes this a good example of a “Gunrock+X” app, where we’d need to integrate the high-performance graph processing component with arbitrary user code.

Summary of Gunrock Implementation#

The scoring model can be an arbitrary function (e.g., of node metadata). For example, if we were running GS on the Twitter friends/followers graph, the scoring model might be the output of a text classifier on each users’ messages. Thus, we do not implement the scoring model in our Gunrock implementation – instead, we read scores from an input file and access them as necessary.

GS is a generalization of a random walk implementation, where there can be more variety in the transition function between nodes.

The GS uniform mode is exactly a uniform random walk, so we can use the pre-existing Gunrock application. Given a node, we compute the node to walk to as:

r = random.uniform(0, 1)
neighbors = graph.get_neighbors(node)
next_node = neighbors[floor(r * len(neighbors))]

Both the GraphSearch greedy and stochastic_greedy consist of small modifications to this transition function.

For greedy, we find the neighbor with maximum score:

neighbors = graph.get_neighbors(node)
next_node = neighbors[0]
next_node_score = scores[next_node]
for neighbor in neighbors:
    neighbor_score = scores[neighbor]
    if neighbor_score > next_node_score:
        next_node = neighbor
        next_node_score = neighbor_score

For stochastic_greedy, we sample neighbors proportional to their score – e.g.:

sum_neighbor_scores = 0
for neighbor in graph.neighbors(node):
   sum_neighbor_scores += scores[neighbor]

r *= sum_neighbor_scores

tmp = 0
for neighbor in graph.neighbors(node):
   tmp += scores[neighbor]
   if r < tmp:
       next_node = neighbor
       break

In Gunrock, we create a frontier containing all of the nodes we want to walk from. Then we map the transition function over the frontier using Gunrock’s ForEach operator. Current nodes in the frontier are replaced with the chosen neighbor, and the walk is (optionally) recorded in an output array.

Because this is such a straightforward modification, we implement GS inside of the existing random walk rw Gunrock application. GS just requires adding a couple of extra flags and one extra array of size |V| to store the node values.

How To Run This Application on DARPA’s DGX-1#

Prereqs/input#

git clone --recursive https://github.com/gunrock/gunrock -b dev-refactor
cd gunrock/tests/rw/
cp ../../gunrock/util/gitsha1.c.in ../../gunrock/util/gitsha1.c
make clean
make

Running the application#

Application specific parameters#

  --walk-mode
      0 = uniform
      1 = greedy
      2 = stochastic_greedy
  --node-value-path
      If --walk-mode != 0, this is the path to node scores
  --store-walks
      0 = just do the walk -- don't actually store it anywhere
      1 = store walks in memory
  --walk-length
      Length of each walk
  --walks-per-node
      Number of walks to do per seed node
  --seed
      Seed for random number generator

Example Command#

# generate random features
python random-values.py 39 > chesapeake.values

# uniform random
./bin/test_rw_9.1_x86_64 --graph-type market --graph-file \
  ../../dataset/small/chesapeake.mtx --walk-mode 0 --seed 123

# greedy
./bin/test_rw_9.1_x86_64 --graph-type market --graph-file \
  ../../dataset/small/chesapeake.mtx --node-value-path chesapeake.values \
  --walk-mode 1

# stochastic greedy
./bin/test_rw_9.1_x86_64 --graph-type market --graph-file \
  ../../dataset/small/chesapeake.mtx --node-value-path chesapeake.values \
  --walk-mode 2 --seed 123

Example Output#

# ------------------------------------------------
# uniform random

Loading Matrix-market coordinate-formatted graph ...
Reading from ../../dataset/small/chesapeake.mtx:
  Parsing MARKET COO format
 (39 nodes, 340 directed edges)...
Done parsing (0 s).
  Converting 39 vertices, 340 directed edges ( ordered tuples) to CSR format...
Done converting (0s).
__________________________
--------------------------
 Elapsed: 0.001907
Using advance mode LB
Using filter mode CULL
num_nodes=39
__________________________
0    0   0   queue3      oversize :  234 ->  682
0    0   0   queue3      oversize :  234 ->  682
0    1   0   queue3      oversize :  682 ->  1085
0    1   0   queue3      oversize :  682 ->  1085
0    5   0   queue3      oversize :  1085 ->     1166
0    5   0   queue3      oversize :  1085 ->     1166
--------------------------
Run 0 elapsed: 4.551888, #iterations = 10
[[0, 38, 8, 35, 11, 25, 13, 27, 37, 7, ],
[1, 34, 1, 38, 30, 38, 29, 37, 7, 37, ],
[2, 17, 2, 38, 4, 38, 10, 18, 14, 28, ],
...
[36, 33, 0, 22, 38, 27, 37, 18, 38, 8, ],
[37, 21, 31, 17, 25, 17, 18, 32, 37, 26, ],
[38, 7, 8, 34, 6, 5, 6, 5, 38, 19, ]]
-------- NO VALIDATION -----[rw] finished.
 avg. elapsed: 4.551888 ms
 iterations: 10
 min. elapsed: 4.551888 ms
 max. elapsed: 4.551888 ms
 load time: 60.925 ms
 preprocess time: 964.890000 ms
 postprocess time: 0.715017 ms
 total time: 970.350027 ms

# ------------------------------------------------
# greedy
# !! In this case, the output is formatted as `GPU_result:CPU_result`, for correctness checking

Loading Matrix-market coordinate-formatted graph ...
Reading from ../../dataset/small/chesapeake.mtx:
  Parsing MARKET COO format
 (39 nodes, 340 directed edges)...
Done parsing (0 s).
  Converting 39 vertices, 340 directed edges ( ordered tuples) to CSR format...
Done converting (0s).
__________________________
--------------------------
 Elapsed: 0.085831
Using advance mode LB
Using filter mode CULL
num_nodes=39
__________________________
0    0   0   queue3      oversize :  234 ->  682
0    0   0   queue3      oversize :  234 ->  682
0    1   0   queue3      oversize :  682 ->  770
0    1   0   queue3      oversize :  682 ->  770
--------------------------
Run 0 elapsed: 0.695944, #iterations = 10
[[0:0, 22:22, 32:32, 18:18, 11:11, 18:18, 11:11, 18:18, 11:11, 18:18, ],
[1:1, 22:22, 32:32, 18:18, 11:11, 18:18, 11:11, 18:18, 11:11, 18:18, ],
[2:2, 17:17, 2:2, 17:17, 2:2, 17:17, 2:2, 17:17, 2:2, 17:17, ],
...
[36:36, 33:33, 36:36, 33:33, 36:36, 33:33, 36:36, 33:33, 36:36, 33:33, ],
[37:37, 18:18, 11:11, 18:18, 11:11, 18:18, 11:11, 18:18, 11:11, 18:18, ],
[38:38, 2:2, 17:17, 2:2, 17:17, 2:2, 17:17, 2:2, 17:17, 2:2, ]]
0 errors occurred.
[rw] finished.
 avg. elapsed: 0.695944 ms
 iterations: 10
 min. elapsed: 0.695944 ms
 max. elapsed: 0.695944 ms
 load time: 44.2419 ms
 preprocess time: 974.721000 ms
 postprocess time: 0.731945 ms
 total time: 976.338863 ms

# ------------------------------------------------
# stochastic_greedy
# Output same format as `uniform` above.
# No correctness checking is implemented due to stochasticity.

Expected Output#

When run in --verbose mode, the app outputs the walks. When run in --quiet mode, it outputs performance statistics (e.g., total number of steps taken). If running greedy GraphSearch, the app also outputs the results of a correctness check. Correctness checks for uniform and stochastic_greedy are omitted because of their inherent stochasticity.

Validation#

The correctness of the implementation has been validated in outside experiments, by making sure that the output walks are valid and the distribution of transitions is as expected.

Performance and Analysis#

Performance is measured by the runtime of the app, given

  • an input graph G=(U, E)

  • set of seed nodes (hardcoded to all nodes in G)

  • number of walks per seed

  • number of steps per walk

  • a transition function (e.g., uniform|greedy|stochastic_greedy)

Implementation limitations#

The output of the random walk is a dense array of size (# seeds) * (steps per walk) * (walks per seed). When we have a large graph or long walks or multiple walks per seed, this array may exceed the size of GPU memory.

At the moment, we only support walks starting from all of the nodes in G. It would be straightforward to add a parameter that would allow the use to specify a smaller set of seed nodes.

This app can only be used for graphs that have scores associated w/ each node. In order to run benchmarks, if scores are not available we often assign uniformly random scores to nodes. The distribution of these scores may affect the runtime of the algorithm by changing data access patterns – we test on the provided Twitter dataset, but do not have a variety of other node attributed graphs to test on.

Comparison against existing implementations#

We measure runtime on the HIVE graphsearch Twitter dataset. This graph has |U|=9291392 nodes and |E|=21741663 edges.

At a high level, the results show:

Variant

OpenMP w/ 64 threads

Gunrock GPU

Gunrock Speedup

Directed greedy

236ms

64ms

3.7x

Directed random

158ms

34ms

4.6x

Undirected random

3186ms

630ms

5.0x

The undirected random walks take ~ 10x longer because directed walks terminate when they encounter a node without any neighbors and thus have average length significantly shorter than the --walk-length parameter.

Details and raw data follow.

HIVE Python reference implementation#

We run the HIVE Python reference implementation w/ the following settings:

  • undirected graph

  • uniform transition function

  • 1000 random seeds

  • 128 steps per walk

With the uniform transition function, the run took 41 seconds. Walks are done sequentially, so runtime will scale linearly with the number of seeds. This implementation is substantially slower than even a single-threaded run of PNNLs OpenMP code. Thus, we omit further analysis.

PNNL OpenMP implementation#

We run the PNNL OpenMP implementation on the Twitter graph w/ the following settings:

  • commit: 69864383f0fc0e8aace52be34b329a2f8a58afb6

  • 1,2,4,8,16,32 or 64 threads

  • greedy or uniform transition function

  • directed or undirected graph

We omit the greedy undirected case because the algorithm gets stuck jumping between a local maximum and its highest-scoring neighbor.

threads

method

directed?

nseeds

elapsed_sec

nsteps

steps_per_sec

1

greedy

yes

7199978

3.02876

16325873

5.39e+06

2

greedy

yes

7199978

2.83467

16325873

5.75e+06

4

greedy

yes

7199978

1.64405

16325873

9.93e+06

8

greedy

yes

7199978

0.870028

16325873

1.87e+07

16

greedy

yes

7199978

0.605769

16325873

2.69e+07

32

greedy

yes

7199978

0.43742

16325873

3.73e+07

64

greedy

yes

7199978

0.236701

16325873

6.89e+07

1

unif.

yes

7199978

14.6291

14510781

991915

2

unif.

yes

7199978

24.2175

14186833

585809

4

unif.

yes

7199978

25.1764

14487202

575427

8

unif.

yes

7199978

27.7312

13937449

502591

16

unif.

yes

7199978

30.5377

14062226

460488

32

unif.

yes

7199978

32.1057

13906144

433137

64

unif.

yes

7199978

31.2754

13876284

443680

1

unif.

no

100000

12.3982

12700000

1.024+06

2

unif.

no

100000

19.7925

12700000

641658

4

unif.

no

100000

22.5432

12700000

563362

8

unif.

no

100000

26.1053

12700000

486491

16

unif.

no

100000

28.275

12700000

449160

32

unif.

no

100000

28.334

12700000

448224

64

unif.

no

100000

28.7419

12700000

441864

Note that we use fewer seeds for the undirected uniform case due to slow runtime.

Observe that the rand modes have very bad scaling as a function of cores. After investigation, this was due to two issues. First, the neighbors were being sampled incorrectly, which led to chaotic behavior. Second, the app was using a slow random number generator w/ an excessive number of seed resets. We created a PR to fix those issues here.

After these fixes, runtimes were as follows:

threads

method

directed?

nseeds

elapsed_sec

nsteps

steps_per_sec

1

greedy

yes

7199978

3.02876

16325873

5.39e+06

2

greedy

yes

7199978

2.83467

16325873

5.75e+06

4

greedy

yes

7199978

1.64405

16325873

9.93e+06

8

greedy

yes

7199978

0.870028

16325873

1.87e+07

16

greedy

yes

7199978

0.605769

16325873

2.69e+07

32

greedy

yes

7199978

0.43742

16325873

3.73e+07

64

greedy

yes

7199978

0.236701

16325873

6.89e+07

1

unif.

yes

7199978

1.49886

16529694

1.10e+07

2

unif.

yes

7199978

1.60176

16533004

1.03e+07

4

unif.

yes

7199978

0.974128

16538957

1.69e+07

8

unif.

yes

7199978

0.455227

16534756

3.63+07

16

unif.

yes

7199978

0.257524

16528617

6.42e+07

32

unif.

yes

7199978

0.155722

13906144

1.06e+08

64

unif.

yes

7199978

0.158828

16537488

1.04e+08

1

unif.

no

7199978

125.963

914397206

1.92e+08

2

unif.

no

7199978

78.927

914397206

1.80e+08

4

unif.

no

7199978

39.7097

914397206

2.96e+08

8

unif.

no

7199978

22.5195

914397206

6.35e+08

16

unif.

no

7199978

11.0047

914397206

1.12e+09

32

unif.

no

7199978

5.56317

914397206

1.85e+09

64

unif.

no

7199978

3.18615

914397206

1.82e+09

Note the improved runtimes and scaling. These experiments were run with this branch at commit 6c25a0687eecebfd4393e86fa4c7308d5594b73d.

All experiments are conducted on the HIVE DGX-1.

Gunrock GPU implementation#

directed, greedy#
./bin/test_rw_9.1_x86_64 --graph-type market --graph-file dir_gs_twitter.mtx \
    --node-value-path gs_twitter.values \
    --walk-mode 1 \
    --walk-length 32 \
    --undirected=0 \
    --store-walks 0 \
    --quick \
    --num-runs 10
Loading Matrix-market coordinate-formatted graph ...
Reading from dir_gs_twitter.mtx:
  Parsing MARKET COO format
 (7199978 nodes, 21741663 directed edges)...
Done parsing (7 s).
  Converting 7199978 vertices, 21741663 directed edges ( ordered tuples) to CSR format...
Done converting (0s).
==============================================
 advance-mode=LB
Using advance mode LB
Using filter mode CULL
Run 0 elapsed: 65.273046, #iterations = 32
Run 1 elapsed: 64.157963, #iterations = 32
Run 2 elapsed: 64.009190, #iterations = 32
Run 3 elapsed: 64.055920, #iterations = 32
Run 4 elapsed: 64.069033, #iterations = 32
Run 5 elapsed: 64.002037, #iterations = 32
Run 6 elapsed: 64.031839, #iterations = 32
Run 7 elapsed: 64.036846, #iterations = 32
Run 8 elapsed: 64.065933, #iterations = 32
Run 9 elapsed: 64.047098, #iterations = 32
Validate_Results: total_neighbors_seen=298668024
Validate_Results: total_steps_taken=16325873
-------- NO VALIDATION --------
[rw] finished.
 avg. elapsed: 64.174891 ms
 iterations: 32
 min. elapsed: 64.002037 ms
 max. elapsed: 65.273046 ms
 load time: 7086.91 ms
 preprocess time: 1016.620000 ms
 postprocess time: 101.121902 ms
 total time: 2073.837996 ms
directed, uniform#
./bin/test_rw_9.1_x86_64 --graph-type market --graph-file dir_gs_twitter.mtx \
    --node-value-path gs_twitter.values \
    --walk-mode 0 \
    --walk-length 128 \
    --undirected=0 \
    --store-walks 0 \
    --quick \
    --num-runs 10 \
    --seed 123
Loading Matrix-market coordinate-formatted graph ...
Reading from dir_gs_twitter.mtx:
  Parsing MARKET COO format
 (7199978 nodes, 21741663 directed edges)...
Done parsing (7 s).
  Converting 7199978 vertices, 21741663 directed edges ( ordered tuples) to CSR format...
Done converting (1s).
==============================================
 advance-mode=LB
Using advance mode LB
Using filter mode CULL
__________________________
Run 0 elapsed: 38.613081, #iterations = 128
Run 1 elapsed: 34.458876, #iterations = 128
Run 2 elapsed: 34.530163, #iterations = 128
Run 3 elapsed: 33.849001, #iterations = 128
Run 4 elapsed: 33.759117, #iterations = 128
Run 5 elapsed: 33.967972, #iterations = 128
Run 6 elapsed: 33.873081, #iterations = 128
Run 7 elapsed: 33.970118, #iterations = 128
Run 8 elapsed: 33.756971, #iterations = 128
--------------------------
Run 9 elapsed: 33.715963, #iterations = 128
Validate_Results: total_neighbors_seen=289124779
Validate_Results: total_steps_taken=16530404
-------- NO VALIDATION --------
[rw] finished.
 avg. elapsed: 34.449434 ms
 iterations: 128
 min. elapsed: 33.715963 ms
 max. elapsed: 38.613081 ms
 load time: 7176.17 ms
 preprocess time: 1016.720000 ms
 postprocess time: 101.902962 ms
 total time: 1781.071901 ms
undirected, uniform#
./bin/test_rw_9.1_x86_64 --graph-type market --graph-file undir_gs_twitter.mtx \
    --node-value-path gs_twitter.values \
    --walk-mode 0 \
    --walk-length 128 \
    --store-walks 0 \
    --quick \
    --num-runs 10 \
    --seed 123
Loading Matrix-market coordinate-formatted graph ...
Reading from undir_gs_twitter.mtx:
  Parsing MARKET COO format
 (7199978 nodes, 43483326 directed edges)...
Done parsing (7 s).
  Converting 7199978 vertices, 43483326 directed edges ( ordered tuples) to CSR format...
Done converting (0s).
==============================================
 advance-mode=LB
Using advance mode LB
Using filter mode CULL
Run 0 elapsed: 636.021852, #iterations = 128
Run 1 elapsed: 631.129026, #iterations = 128
Run 2 elapsed: 631.053925, #iterations = 128
Run 3 elapsed: 631.713152, #iterations = 128
Run 4 elapsed: 631.028175, #iterations = 128
Run 5 elapsed: 631.374836, #iterations = 128
Run 6 elapsed: 631.196976, #iterations = 128
Run 7 elapsed: 632.030964, #iterations = 128
Run 8 elapsed: 631.026983, #iterations = 128
Run 9 elapsed: 630.996943, #iterations = 128
Validate_Results: total_neighbors_seen=75443835041
Validate_Results: total_steps_taken=914397206
-------- NO VALIDATION --------
[rw] finished.
 avg. elapsed: 631.757283 ms
 iterations: 128
 min. elapsed: 630.996943 ms
 max. elapsed: 636.021852 ms
 load time: 7705.9 ms
 preprocess time: 1010.830000 ms
 postprocess time: 102.057934 ms
 total time: 7755.448818 ms

Performance limitations#

  • For the undirected uniform settings, profiling shows that 79% of compute time is spent in the ForAll operator and 20% is spent in the curand random number generator. Device memory bandwidth in the ForAll kernel is 193 GB/s.

  • For the directed greedy settings, profiling shows that 99.5% of compute time is spent in the ForAll operator. Device memory bandwidth in the ForAll kernel is 136 GB/s.

When we do a large number of walks and/or the length of each walk is very long, there may not be enough GPU memory to store all of the walks in memory. For now, we expose the --store-walks parameter – when this is set to zero, the walk is discarded as it is computed and only the length of the walk is stored. A better solution that could be implemented in the future would be to move walks from GPU to CPU memory as they grow too large.

Optimization: In a directed walk, once we hit a node with no outgoing neighbors, we halt the walk. In the current Gunrock implementation, the enactor runs for a fixed number of iterations, regardless of whether any of the nodes are still active. It would be straightforward to add a check that terminates the app when no “living” nodes are left.

Next Steps#

Alternate approaches#

The size of the output array may become a significant bottleneck for large graphs. However, since all of the transition functions do not depend on anything besides the current node, we could reasonably move the results of the walk from GPU to CPU memory every N iterations. Properly executed, this should eliminate the largest bottleneck without unduely impacting performance.

Gunrock implications#

For the greedy and stochastic_greedy transition function, we have to sequentially iterate over all of a node’s neighbors. Simple wrappers for computing, e.g., the maximum of node scores across all of a node’s neighbors could be helpful, both for ease of programming and performance. Gunrock has a newly added NeighborReduce kernel that supports associative reductions – it should be straightforward to implement (at least) the greedy transition function with this kernel. The stochastic_greedy transition function would require a more complex reduction function along the lines of reservoir sampling.

Notes on multi-GPU parallelization#

If the graph is small enough to be duplicated on each GPU, the implementation is trivial: just do a subset of the walks on each GPU. The scalability will be perfect, as there is no communication involved at all.

When the graph is distributed across multiple GPUs, we expect to have very poor scalability, as the ratio of computation to communication is very low. A more detailed discussion is available here.

Notes on dynamic graphs#

This workflow does not have an explicit dynamic component. However, because steps only depend on the current node, the underlying graph could change during the walks.

Notes on larger datasets#

The random accesses inherent to graph search make it a particularly difficult workflow for larger-than-GPU memory datasets. The most straightforward solution would be to let Unified Virtual Memory (UVM) in CUDA automatically handle memory movement, but we should expect to see a substantial reduction in performance.

Notes on other pieces of this workload#

In real use cases, the scoring function would be computed lazily – that is, we wouldn’t have a precomputed array with scores for each of the nodes, and we would need to run the scoring function as the walk is running. Thus, it would be critical for us to be able to call the scoring function from within Gunrock quickly and without excessive programmer overhead.