Groute performance vs. Gunrock

We noted with interest the PPoPP 2017 paper Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations by Tal Ben-Nun, Michael Sutton, Sreepathi Pai, and Keshav Pingali (DOI). This is really nice work, and we particularly admire their use of asynchronous execution. We expect (and show in the results below) that for high-diameter networks like road networks (e.g., europe_osm, road_usa), their approach is particularly beneficial. Gunrock's design is more targeted towards scale-free graphs (e.g., kron_g500, soc-LiveJournal1, twitter-mpi). Groute generally performs better than Gunrock on high-diameter, road-network-ish graphs, and also has an excellent connected-components implementation.

In their paper, the Groute authors compared against Gunrock 0.3.1 (released 9 November 2015), which was the most recent release at the time of paper submission (but had been updated to 0.4 by the time of camera-ready submission). Between the Gunrock 0.3.1 release and the time of Groute paper submission, the Gunrock team had made significant performance improvements to Gunrock. We ran the Groute PPoPP artifact locally to compare against two versions of Gunrock (methodology discussion).

The graphs at the bottom of the page use Gunrock 0.4 and Groute's PPoPP artifact and only reflect Gunrock's (non-direction-optimized) BFS performance. In general, Gunrock's direction-optimized (DOBFS) BFS results on scale-free graphs are significantly better than its non-direction-optimized BFS results. We believe this comparison against Gunrock 0.4's BFS is the most appropriate comparison at the time of Groute's camera-ready submission (January 2017).

We choose two results from the Groute paper for discussion below: Gunrock's BFS on the soc-LiveJournal1 and kron21 datasets. The plots at the bottom of the page provide a fuller comparison of Gunrock 0.4 vs. Groute on 5 primitives across 5 datasets on 5 different GPUs.

Gunrock BFS on soc-LiveJournal1

The Groute paper reported Gunrock's best BFS time on soc-LiveJournal1 as 99.11 ms (Groute's Table 3) on M60 GPUs. Gunrock 0.4's BFS achieves 23.95 ms on this dataset on M60 GPUs.

For the 11 July 2016 version of Gunrock, we measure the following results on K40 and K80 GPUs (which achieve similar runtimes to M60s in our experiments):

Non-idempotent, not direction-optimized

Idempotent, not direction optimized:

Yuechao notes that he fixed a correctness bug in idempotence mode on 4 October 2016 (https://github.com/gunrock/gunrock/commit/23490d30fb330c984ba9cb3239838d5dbe2d155d). For our testing in idempotence mode only, we measured Gunrock versions both immediately before and immediately after this bug was fixed ("the performance differences were very small"). We believe running on any July-October Gunrock build would give similar performance results.

DOBFS

Multi-GPU DOBFS was enabled in Gunrock's BFS, and single-GPU direction-optimizing BFS was removed, as of 26 April 2016 (https://github.com/gunrock/gunrock/commit/1fbbc85ab07fcbb0d418202fcd5a77290b6df508). Gunrock's DOBFS has different behavior to Groute's (or anyone else's) BFS, which makes performance differences more challenging to explain. The following four results indicate Gunrock's single-GPU DOBFS performance:

Gunrock BFS on kron21

The Groute paper reported Gunrock's best BFS time on kron21 as 156.68 ms (Groute's Table 3) on M60 GPUs. Gunrock 0.4 achieves 19.315 ms on this dataset running BFS (not DOBFS) on M60 GPUs. If we switch to DOBFS, Gunrock 0.4 achieves 4.53 ms on one K80 GPU.

For the 11 July 2016 version of Gunrock, we measure the following results on K40 and K80 GPUs (which achieve similar runtimes as M60s in our experiments):

Other notes

In their paper, the Groute authors noted issues with Gunrock's accuracy on PageRank on multiple GPUs ("the evaluated version of Gunrock's multi-GPU PageRank produced incorrect results"). The Groute authors raised this issue on 17 September 2016 in a github issue, to which we responded on 3 October 2016. We noted in our response at that time that the issue is not an error but instead not using the proper command-line options for the desired comparison.

  • We believe that properly setting command-line parameters will allow several data sets to run to completion (for both Gunrock and for B40C) where the Groute paper instead reported errors or were un-runnable.
    • For example, -src=randomize lets B40C run the kron21 dataset properly; without a randomized source, B40C (unsuccessfully) tries to find a source that reaches more than 5 edges. Here, -src=randomize results in a timing of 5.49 ms, and --src=randomize --num-gpus=4 --undirected -i=32 gives 17.49 ms for the average runtime.
    • For Gunrock + twitter, -queue-sizing=0.1 -device=0,1,2,3 allows a successful run (735.13 ms); market /data/gunrock_dataset/huge/twitter-mpi/twitter-mpi.mtx --device=0,1,2,3 --queue-sizing=0.1 --idempotence --src=randomize2 -iteration-num=32 measures 342.21 ms.
    • For Gunrock + connected components, we found Gunrock ran properly on kron21 even on a single GPU with no command-line parameters, and should be able to run on more GPUs as well (we successfully tested up to 4xK40c).

We note that Groute's circular work list overflowed on Tesla K40c for some PageRank runs with the twitter and kron datasets (circular worklist has overflowed, please allocate more memory). We haven't yet worked out the right command-line switch to allocate more memory for this case, although we're sure this is a simple fix.

Full performance comparison

The following plot compares Gunrock 0.4 with Groute's PPoPP artifact. It has multiple GPUs on one plot. We have broken them out by GPU on individual pages here: [ Tesla P100 | Tesla K40c | Tesla K40m | Tesla K80 | Tesla M60 ]

Source data, with links to the output JSON for each run