NAV Navbar

  • Build Status
    Apache 2 Issues Open

    Gunrock: GPU Graph Analytics

    Gunrock is a CUDA library for graph-processing designed specifically for the GPU. It uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on a vertex or edge frontier. Gunrock achieves a balance between performance and expressiveness by coupling high performance GPU computing primitives and optimization strategies with a high-level programming model that allows programmers to quickly develop new graph primitives with small code size and minimal GPU programming knowledge.

    For more details, please visit our website, read Why Gunrock, our TOPC 2017 paper Gunrock: GPU Graph Analytics, look at our results, and find more details in our publications. See Release Notes to keep up with the our latest changes.

    Gunrock is featured on NVIDIA's list of GPU Accelerated Libraries as the only external library for GPU graph analytics.

    Service System Environment Status
    Jenkins Ubuntu 16.04.4 LTS CUDA 10.0, NVIDIA Driver 410.73, GCC/G++ 5.4, Boost 1.58.0 Build Status


    git clone --recursive
    cd gunrock
    mkdir build
    cd build
    cmake ..
    make -j$(nproc)
    make test

    Gunrock Source Code

    Related Pages Modules Namespaces Data Structures Files

    Getting Started with Gunrock

    Results and Analysis

    We are gradually adding summaries of our results to these web pages (please let us know if you would like other comparisons). These summaries also include a table of results along with links to the configuration and results of each individual run. We detail our methodology for our measurements here.

    For reproducibility, we maintain Gunrock configurations and results in our github gunrock/io repository.

    We are happy to run experiments with other engines, particularly if those engines output results in our JSON format / a format that can be easily parsed into JSON format.

    Reporting Problems

    To report Gunrock bugs or request features, please file an issue directly using Github.


    Yuechao Pan, Roger Pearce, and John D. Owens. Scalable Breadth-First Search on a GPU Cluster. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018, May 2018. [http]

    Yangzihao Wang, Yuechao Pan, Andrew Davidson, Yuduo Wu, Carl Yang, Leyuan Wang, Muhammad Osama, Chenshan Yuan, Weitang Liu, Andy T. Riffel, and John D. Owens. Gunrock: GPU Graph Analytics. ACM Transactions on Parallel Computing, 4(1):3:1–3:49, August 2017. [DOI | http]

    Yuechao Pan, Yangzihao Wang, Yuduo Wu, Carl Yang, and John D. Owens. Multi-GPU Graph Analytics. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017, pages 479–490, May/June 2017. [DOI | http]

    Yangzihao Wang, Sean Baxter, and John D. Owens. Mini-Gunrock: A Lightweight Graph Analytics Framework on the GPU. In Graph Algorithms Building Blocks, GABB 2017, pages 616–626, May 2017. [DOI | http]

    Leyuan Wang, Yangzihao Wang, Carl Yang, and John D. Owens. A Comparative Study on Exact Triangle Counting Algorithms on the GPU. In Proceedings of the 1st High Performance Graph Processing Workshop, HPGP '16, pages 1–8, May 2016. [DOI | http]

    Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. Gunrock: A High-Performance Graph Processing Library on the GPU. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '16, pages 11:1–11:12, March 2016. Distinguished Paper. [DOI | http]

    Yuduo Wu, Yangzihao Wang, Yuechao Pan, Carl Yang, and John D. Owens. Performance Characterization for High-Level Programming Models for GPU Graph Analytics. In IEEE International Symposium on Workload Characterization, IISWC-2015, pages 66–75, October 2015. Best Paper finalist. [DOI | http]

    Carl Yang, Yangzihao Wang, and John D. Owens. Fast Sparse Matrix and Sparse Vector Multiplication Algorithm on the GPU. In Graph Algorithms Building Blocks, GABB 2015, pages 841–847, May 2015. [DOI | http]

    Afton Geil, Yangzihao Wang, and John D. Owens. WTF, GPU! Computing Twitter's Who-To-Follow on the GPU. In Proceedings of the Second ACM Conference on Online Social Networks, COSN '14, pages 63–68, October 2014. [DOI | http]


    GTC 2018, Latest Development of the Gunrock Graph Processing Library on GPUs, March 2018. [slides | video]

    GTC 2018, Writing Graph Primitives with Gunrock, March 2018. [slides | video]

    GTC 2016, Gunrock: A Fast and Programmable Multi-GPU Graph Processing Library, April 2016. [slides]

    NVIDIA webinar, April 2016. [slides]

    GPU Technology Theater at SC15, Gunrock: A Fast and Programmable Multi-GPU Graph processing Library, November 2015. [slides | video]

    GTC 2014, High-Performance Graph Primitives on the GPU: design and Implementation of Gunrock, March 2014. [slides | video]

    Gunrock Developers


    Thanks to the following developers who contributed code: The connected-component implementation was derived from code written by Jyothish Soman, Kothapalli Kishore, and P. J. Narayanan and described in their IPDPSW '10 paper A Fast GPU Algorithm for Graph Connectivity (DOI). The breadth-first search implementation and many of the utility functions in Gunrock are derived from the b40c library of Duane Merrill. The algorithm is described in his PPoPP '12 paper Scalable GPU Graph Traversal (DOI). Thanks to Erich Elsen and Vishal Vaidyanathan from Royal Caliber and the Onu Team for their discussion on library development and the dataset auto-generating code. Thanks to Adam McLaughlin for his technical discussion. Thanks to Oded Green for his technical discussion and an optimization in the CC primitive. Thanks to the Altair and Vega-lite teams in the Interactive Data Lab at the University of Washington for graphing help. We appreciate the technical assistance, advice, and machine access from many colleagues at NVIDIA: Chandra Cheij, Joe Eaton, Michael Garland, Mark Harris, Ujval Kapasi, David Luebke, Duane Merrill, Josh Patterson, Nikolai Sakharnykh, and Cliff Woolley.

    This work was funded by the DARPA HIVE program under AFRL Contract FA8650-18-2-7835, the DARPA XDATA program under AFRL Contract FA8750-13-C-0002, by NSF awards OAC-1740333, CCF-1629657, OCI-1032859, and CCF-1017399, by DARPA STTR award D14PC00023, and by DARPA SBIR award W911NF-16-C-0020. Our XDATA principal investigator was Eric Whyne of Data Tactics Corporation and our DARPA program manager is Mr. Wade Shen (since 2015), and before that Dr. Christopher White (2012–2014). Thanks to Chris, Wade, and DARPA business manager Gabriela Araujo for their support during the XDATA program.

    Gunrock is copyright The Regents of the University of California, 2013–2018. The library, examples, and all source code are released under Apache 2.0.

    Building Gunrock

    Gunrock's current release has been tested on Linux Mint 15 (64-bit), Ubuntu 12.04, 14.04 and 15.10 with CUDA 7.5 installed, compute architecture 3.0 and g++ 4.8. We expect Gunrock to build and run correctly on other 64-bit and 32-bit Linux distributions, Mac OS, and Windows.



    Required Dependencies:

    Optional Dependencies:


    Simple Gunrock Compilation:

    Downloading gunrock

    # Using git (recursively) download gunrock
    git clone --recursive
    # Using wget to download gunrock
    wget --no-check-certificate

    Compiling gunrock

    cd gunrock
    mkdir build && cd build
    cmake ..

    Advance Gunrock Compilation:

    You can also compile gunrock with more specific/advanced settings using cmake -D[OPTION]=ON/OFF. Following options are available:

    Example for compiling gunrock with only Breadth First Search (BFS) primitive

    mkdir build && cd build

    Generating Datasets

    All dataset-related code is under the gunrock/dataset/ subdirectory. The current version of Gunrock only supports Matrix-market coordinate-formatted graph format. The datasets are divided into two categories according to their scale. Under the dataset/small/ subdirectory, there are trivial graph datasets for testing the correctness of the graph primitives. All of them are ready to use. Under the dataset/large/ subdirectory, there are large graph datasets for doing performance regression tests. * To download them to your local machine, just type make in the dataset/large/ subdirectory. * You can also choose to only download one specific dataset to your local machine by stepping into the subdirectory of that dataset and typing make inside that subdirectory.


    Laboratory Tested Hardware: Gunrock with GeForce GTX 970, Tesla K40s. We have not encountered any trouble in-house with devices with CUDA capability >= 3.0.

    Why Gunrock?

    Gunrock is a stable, powerful, and forward-looking substrate for GPU-based graph-centric research and development. Like many graph frameworks, it leverages a bulk-synchronous programming model and targets iterative convergent graph computations. We believe that today Gunrock offers both the best performance on GPU graph analytics as well as the widest range of primitives.

    Methodology for Graph Analytics Performance

    We welcome comments from others on the methodology that we use for measuring Gunrock's performance.

    Currently, Gunrock is a library that requires no preprocessing. By this we mean that Gunrock inputs graphs in a "standard" format, e.g., compressed sparse row or coordinate, such as those available on common graph repositories (SNAP or SuiteSparse (UF)). In our experiments, we use MatrixMarket format.

    Other graph libraries may benefit from preprocessing of input datasets. We would regard any manipulation of the input dataset (e.g., reordering the input or more sophisticated preprocessing such as graph coloring or CuSha's G-Shards) to be preprocessing. We think preprocessing is an interesting future direction for Gunrock, but have not yet investigated it. We hope that any graph libraries that do preprocessing report results with both preprocessed and unmodified input datasets.

    (That being said, we do standardize input graphs in two ways: before running our experiments, we remove self-loops/duplicated edges. If the undirected flag is set, we convert the input graph to undirected. When we do so, that implies one edge in each direction, and we report edges for that graph accordingly. What we do here appears to be standard practice.)

    In general, we try to report results in two ways:

    To calculate TEPS, we require the number of edges traversed (touched), which we count dynamically. For traversal primitives, we note that non-connected components will not be visited, so the number of visited edges may be fewer than the number of edges in the graph. We note that precisely counting edges during the execution of a particular primitive may have performance implications, so we may approximate (see BFS).

    Notes on specific primitives follow.


    When we count the number of edges traversed, we do so by summing the number of outbound edges for each visited vertex. For forward, non-idempotent BFS, this strategy should give us an exact count, since this strategy visits every edge incident to a visited vertex. When we enable idempotence, we may visit a node more than once and hence may visit an edge more than once. For backward (pull) BFS, when we visit a vertex, we count all edges incoming to that vertex even if we find a visited predecessor before traversing all edges (and terminate early). (To do so otherwise has performance implications.) Enterprise uses the same counting strategy.

    If a comparison library does not measure MTEPS for BFS, we compute it by the number of edges visited divided by runtime; if the former is not available, we use Gunrock's edges-visited count.


    In general we find MTEPS comparisons between different approaches to SSSP not meaningful: because an edge may be visited one or many times, there is no standard way to count edges traversed. Different algorithms may not only visit a very different number of edges (Dijkstra vs. Bellman-Ford will have very different edge visit counts) but may also have a different number of edges visited across different invocations of the same primitive.

    When we report Gunrock's SSSP MTEPS, we use the number of edges queued as the edge-traversal count.

    To have a meaningful SSSP experiment, it is critical to have varying edge weights. SSSP measured on uniform edge weights is not meaningful (it becomes BFS). In our experiments, we set edge weights randomly/uniformly between 1 and 64.


    If a comparison library does not measure MTEPS for BC, we compute it by twice the number of edges visited in the forward phase divided by runtime (the same computation we use for Gunrock).


    We measure PageRank elapsed time on one iteration of PageRank. (Many other engines measure results this way and it is difficult to extrapolate from this measurement to runtime of the entire algorithm.)

    Programming Model

    This page describes the programming model we use in Gunrock.

    Gunrock targets graph computations that are generally expressed as "iterative convergent processes". By "iterative," we mean operations that may require running a series of steps repeatedly; by "convergent," we mean that these iterations allow us to approach the correct answer and terminate when that answer is reached. Many graph-computation programming models target a similar goal.

    Many of these programming models focus on sequencing steps of computation. Gunrock differs from these programming models in its focus on manipulating a data structure. We call this data structure a frontier of vertices or edges. The frontier represents the subset of vertices or edges that is actively participating in the computation. Gunrock operators input one or more frontiers and output one or more frontiers.

    Generically, graph operations can often be expressed via a push abstraction (graph elements "push" local private updates into a shared state) or a pull abstraction (graph elements "pull" updates into their local private state) (Besta et al. publication on push-vs.-pull, HPDC '17). Gunrock's programming model supports both of these abstractions. (For instance, Gunrock's direction-optimized breadth-first-search supports both push and pull BFS phases. Mini-Gunrock supports pull-based BFS and PR.) Push-based approaches may or may not require synchronization (such as atomics) for correct operation; this depends on the primitive. Gunrock's idempotence optimization (within its BFS implementation) is an example of a push-based primitive that does not require atomics.


    In the current Gunrock release, we support four operators.

    We note that compute operators can often be fused with a neighboring operator into a single kernel. This increases producer-consumer locality and improves performance. Thus within Gunrock, we express compute operators as "functors", which are automatically merged into their neighboring operators. Within Gunrock, we express functors in one of two flavors:

    Creating a New Graph Primitive

    To create a new graph primitive, we first put all the problem-specific data into a data structure. For BFS, we need a per-node label value and a per-node predecessor value; for CC, we need a per-edge mark value, a per-node component id value, etc. Then we map the algorithm into the combination of the above three operators. Next, we need to write different functors for these operators. Some graph algorithms require only one functor (BFS), but some graph algorithms need more (CC needs seven). Finally, we write an enactor to load the proper operator with the proper functor. We provide a graph primitive template. The problem, functor, and enactor files are under gunrock/app/sample, and the driver code is under tests/sample.

    Git Forking Workflow

    Transitioning over from Git Branching Workflow suggested by Vincent Driessen at nvie to Git Forking Workflow for Gunrock.

    How Forking Workflow Works?

    Forking Workflow As in the other Git workflows, the Forking Workflow begins with an official public repository stored on a server. But when a new developer wants to start working on the project, they do not directly clone the official repository.

    Instead, they fork the official repository to create a copy of it on the server. This new copy serves as their personal public repository—no other developers are allowed to push to it, but they can pull changes from it (we’ll see why this is important in a moment). After they have created their server-side copy, the developer performs a git clone to get a copy of it onto their local machine. This serves as their private development environment, just like in the other workflows.

    When they're ready to publish a local commit, they push the commit to their own public repository—not the official one. Then, they file a pull request with the main repository, which lets the project maintainer know that an update is ready to be integrated. The pull request also serves as a convenient discussion thread if there are issues with the contributed code.

    To integrate the feature into the official codebase, the maintainer pulls the contributor’s changes into their local repository, checks to make sure it doesn’t break the project, merges it into his local master branch, then pushes the master branch to the official repository on the server. The contribution is now part of the project, and other developers should pull from the official repository to synchronize their local repositories.

    Gunrock's Forking Workflow:



    Note that transitioning to this type of workflow from branching model doesn't require much effort, we will just have to start working on our forks and start creating pull requests to one dev branch.

    How to contribute?

    GoogleTest for Gunrock

    Recommended Read: Introduction: Why Google C++ Testing Framework?

    When writing a good test, we would like to cover all possible functions (or execute all code lines), what I will recommend to do is write a simple test, run code coverage on it, and use to determine what lines are not executed. This gives you a good idea of what needs to be in the test and what you are missing.

    What is code coverage?

    Code coverage is a measurement used to express which lines of code were executed by a test suite. We use three primary terms to describe each lines executed.

    • hit indicates that the source code was executed by the test suite.
    • partial indicates that the source code was not fully executed by the test suite; there are remaining branches that were not executed.
    • miss indicates that the source code was not executed by the test suite.

    Coverage is the ratio of hits / (hit + partial + miss). A code base that has 5 lines executed by tests out of 12 total lines will receive a coverage ratio of 41% (rounding down).

    Below is an example of what lines are a hit and a miss; you can target the lines missed in the tests to improve coverage.

    Example CodeCov Stats

    Example Test Using GoogleTest

     * @brief BFS test for shared library advanced interface
     * @file test_lib_bfs.h
    // Includes required for the test
    #include "stdio.h"
    #include "gunrock/gunrock.h"
    #include "gmock/gmock.h"
    #include "gtest/gtest.h"
    // Add to gunrock's namespace
    namespace gunrock {
    /* Test function, test suite in this case is
     * sharedlibrary and the test itself is breadthfirstsearch
    TEST(sharedlibrary, breadthfirstsearch)
        struct GRTypes data_t;                 // data type structure
        data_t.VTXID_TYPE = VTXID_INT;         // vertex identifier
        data_t.SIZET_TYPE = SIZET_INT;         // graph size type
        data_t.VALUE_TYPE = VALUE_INT;         // attributes type
        int srcs[3] = {0,1,2};
        struct GRSetup *config = InitSetup(3, srcs);   // gunrock configurations
        int num_nodes = 7, num_edges = 15;  // number of nodes and edges
        int row_offsets[8]  = {0, 3, 6, 9, 11, 14, 15, 15};
        int col_indices[15] = {1, 2, 3, 0, 2, 4, 3, 4, 5, 5, 6, 2, 5, 6, 6};
        struct GRGraph *grapho = (struct GRGraph*)malloc(sizeof(struct GRGraph));
        struct GRGraph *graphi = (struct GRGraph*)malloc(sizeof(struct GRGraph));
        graphi->num_nodes   = num_nodes;
        graphi->num_edges   = num_edges;
        graphi->row_offsets = (void*)&row_offsets[0];
        graphi->col_indices = (void*)&col_indices[0];
        gunrock_bfs(grapho, graphi, config, data_t);
        int *labels = (int*)malloc(sizeof(int) * graphi->num_nodes);
        labels = (int*)grapho->node_value1;
        // IMPORTANT: Expected output is stored in an array to compare against determining if the test passed or failed
        int result[7] = {2147483647, 2147483647, 0, 1, 1, 1, 2};
        for (int i = 0; i < graphi->num_nodes; ++i) {
          // IMPORTANT: Compare expected result with the generated labels
          EXPECT_EQ(labels[i], result[i]) << "Vectors x and y differ at index " << i;
        if (graphi) free(graphi);
        if (grapho) free(grapho);
        if (labels) free(labels);
    } // namespace gunrock
    1. Create a test_.h file and place it in the appropriate directory inside /path/to/gunrock/tests/. I will be using test_bfs_lib.h as an example.

    2. In the tests/test.cpp file, add your test file as an include: #include "bfs/test_lib_bfs.h".

    3. In your test_.h file, create a TEST() function, which takes two parameters: TEST(, ).

    4. Use EXPECT and ASSERT to write the actual test itself. I have provided a commented example below:

    5. Now when you run the binary called unit_test, it will automatically run your test suite along with all other google tests as well. This binary it automatically compiled when gunrock is built, and is found in /path/to/builddir/bin/unit_test.

    Final Remarks:

    Road Map

    Possible Gunrock projects

    Possible projects are in two categories: infrastructure projects that make Gunrock better but have minimal research value, and research projects that are longer-term and hopefully have research implications of use to the community.

    For any discussion on these, please use the existing Github issue (or make one).

    Infrastructure projects

    Research projects

    Gunrock Release Notes

    Gunrock release 0.5 is a feature (minor) release that adds:

    v0.5 Changelog

    All notable changes to gunrock for v0.5 are documented below:





    Known Issues:

    Frequently Asked Questions

    Some of the most common questions we have come across during the life of Gunrock project. If your question isn't already answered below, feel free to create an issue on GitHub.

    What does it do?

    Gunrock is a fast and efficient graph processing library on the GPU that provides a set of graph algorithms used in big data analytics and visualization with high performance. It also provides a set of operators which abstract the general operations in graph processing for other developers to build high-performance graph algorithm prototypes with minimum programming effort.

    How does it do it?

    Gunrock takes advantage of the immense computational power available in commodity-level, off-the-shelf Graphics Processing Units (GPUs), originally designed to handle the parallel computational tasks in computer graphics, to perform graph traversal and computation in parallel on thousands of GPU's computing cores.

    Who should want this?

    Gunrock is built with two kinds of users in mind: The first kind of users are programmers who build big graph analytics and visualization projects and need to use existing graph primitives provided by Gunrock. The second kind of users are programmers who want to use Gunrock's high-level, programmable abstraction to express, develop, and refine their own (and often more complicated) graph primitives.

    What is the skill set users need to use it?

    For the first kind of users, C/C++ background is sufficient. We are also building Gunrock as a shared library with C interfaces that can be loaded by other languages such as Python and Julia. For the second kind of users, they need to have the C/C++ background and also an understanding of parallel programming, especially BSP (Bulk-Synchronous Programming) model used by Gunrock.

    What platforms/languages do people need to know in order to modify or integrate it with other tools?

    Using the exposed interface, the users do not need to know CUDA or OpenCL to modify or integrate Gunrock to their own tools. However, an essential understanding of parallel programming and BSP model is necessary if one wants to add/modify graph primitives in Gunrock.

    Why would someone want this?

    The study of social networks, webgraphs, biological networks, and unstructured meshes in scientific simulation has raised a significant demand for efficient parallel frameworks for processing and analytics on large-scale graphs. Initial research efforts in using GPUs for graph processing and analytics are promising.

    How is it better than the current state of the art?

    Most existing CPU large graph processing libraries perform worse on large graphs with billions of edges. Supercomputer or expensive clusters can achieve close to real-time feedback with high cost on hardware infrastructure. With GPUs, we can achieve the same real-time feedback with much lower cost on hardware. Gunrock has the best performance among the limited research efforts toward GPU graph processing. Our peak Edge Traversed Per Second (ETPS) can reach 3.5G. And all the primitives in Gunrock have 10x to 25x speedup over the equivalent single-node CPU implementations. With a set of general graph processing operators exposed to users, Gunrock is also more flexible than other GPU/CPU graph library in terms of programmability.

    How would someone get it?

    Gunrock is an open-source library. The code, documentation, and quick start guide are all on its GitHub page.

    Is a user account required?

    No. One can use either git clone or download directly to get the source code and documentation of Gunrock.

    Are all of its components/dependencies easy to find?

    Gunrock has three dependencies. Two of them are also GPU primitive libraries which also reside on GitHub. The third one is Boost (Gunrock uses Boost Graph Library to implement CPU reference testing algorithms). All dependencies do not require installation. To use, one only needs to download or git clone them and put them in the according directories. More details in the installation section of this documentation.

    How would someone install it?

    For C/C++ programmer, integrating Gunrock into your projects is easy. Since it is a template based library, just add the include files in your code. The simple example and all the testrigs will provide detailed information on how to do this.

    For programmers who use Python, Julia, or other language and want to call Gunrock APIs, we are building a shared library with binary compatible C interfaces. It will be included in the soon-to-arrive next release of Gunrock.

    Can anyone install it? Do they need IT help?

    Gunrock is targeted at developers who are familiar with basic software engineering. For non-technical people, IT help might be needed.

    Does this process actually work? All the time? On all systems specified?

    Currently, Gunrock has been tested on two Linux distributions: Linux Mint and Ubuntu. But we expect it to run correctly on other Linux distributions too. We are currently building a CMake solution to port Gunrock to Mac and Windows. The feature will be included in the soon-to-arrive next release of Gunrock.

    How would someone test that it's working with provided sample data?

    Testrigs are provided as well as a small simple example for users to test the correctness and performance of every graph primitive.

    Is the "using" of sample data clear?

    On Linux, one only needs to go to the dataset directory and run "make"; the script will automatically download all the needed datasets. One can also choose to download a single dataset in its separate directory.

    How would someone use it with their own data?

    Gunrock supports Matrix Market (.mtx) file format; users need to pre-process the graph data into this format before running Gunrock.