NAV Navbar
  • GUNROCK: GPU GRAPH ANALYTICS
  • OVERVIEW
  • PROGRAMMING MODEL
  • GUNROCK'S APPLICATION CASES
  • BUILDING GUNROCK
  • METHODOLOGY
  • RESULTS AND ANALYSIS
  • PUBLICATIONS
  • PRESENTATIONS
  • ROAD MAP
  • POSSIBLE GUNROCK PROJECTS
  • GUNROCK DEVELOPERS
  • FREQUENTLY ASKED QUESTIONS
  • RELEASE NOTES
  • ACKNOWLEDGMENTS

  • Build Status
    Apache 2 Issues Open
    NVIDIA Accelerated Libraries RAPIDS

    Gunrock: GPU Graph Analytics

    Gunrock is a CUDA library for graph-processing designed specifically for the GPU. It uses a high-level, bulk-synchronous, data-centric abstraction focused on operations on vertex or edge frontiers. Gunrock achieves a balance between performance and expressiveness by coupling high-performance GPU computing primitives and optimization strategies, particularly in the area of fine-grained load balancing, with a high-level programming model that allows programmers to quickly develop new graph primitives that scale from one to many GPUs on a node with small code size and minimal GPU programming knowledge. For more details, see Gunrock's Overview.

    Service System Environment Status
    Jenkins Ubuntu 18.04.2 LTS CUDA 10.1, NVIDIA Driver 418.39, GCC/G++ 7.3 Build Status

    Quick Start Guide

    Before building Gunrock make sure you have CUDA 7.5 or higher (recommended CUDA 9 or higher) installed on your Linux system. We also support building Gunrock on docker images using the provided docker files under docker subdirectory. For complete build guide, see Building Gunrock.

    git clone --recursive https://github.com/gunrock/gunrock/
    cd gunrock
    mkdir build && cd build
    cmake .. && make -j$(nproc)
    make test
    

    Getting Started with Gunrock

    Gunrock is copyright The Regents of the University of California, 2013–2019. The library, examples, and all source code are released under Apache 2.0.

    Overview

    What is Gunrock?

    Gunrock is a stable, powerful, and forward-looking substrate for GPU-based graph-centric research and development. Like many graph frameworks, it leverages a bulk-synchronous programming model and targets iterative convergent graph computations. We believe that today Gunrock offers both the best performance on GPU graph analytics as well as the widest range of primitives.

    Who may use Gunrock?

    Why use Gunrock?

    What does Gunrock not do?

    Programming Model

    This page describes the programming model we use in Gunrock.

    Gunrock targets graph computations that are generally expressed as "iterative convergent processes". By "iterative," we mean operations that may require running a series of steps repeatedly; by "convergent," we mean that these iterations allow us to approach the correct answer and terminate when that answer is reached. Many graph-computation programming models target a similar goal.

    Many of these programming models focus on sequencing steps of computation. Gunrock differs from these programming models in its focus on manipulating a data structure. We call this data structure a frontier of vertices or edges. The frontier represents the subset of vertices or edges that is actively participating in the computation. Gunrock operators input one or more frontiers and output one or more frontiers.

    Generically, graph operations can often be expressed via a push abstraction (graph elements "push" local private updates into a shared state) or a pull abstraction (graph elements "pull" updates into their local private state) (Besta et al. publication on push-vs.-pull, HPDC '17). Gunrock's programming model supports both of these abstractions. (For instance, Gunrock's direction-optimized Breadth-First Search (BFS) and PageRank (PR) supports both push and pull BFS phases. Mini-Gunrock supports pull-based BFS and PR.) Push-based approaches may or may not require synchronization (such as atomics) for correct operation; this depends on the primitive. Gunrock's idempotence optimization (within its BFS implementation) is an example of a push-based primitive that does not require atomics.

    Operators

    In the current Gunrock release, we support five operators.

    Advance

    Filter

    Compute

    Gunrock's Application Cases

    The following is are a wide variety of graph primitives, including traversal-based (breadth-first search, single-source shortest path); node-ranking (HITS, SALSA, PageRank); and global (connected component, minimum spanning tree) implemented within Gunrock.

    The "Directory" column in the table below shows, which subdirectory within gunrock/app and examples these applications' implementations are. The version number in the "Single GPU" and "Multi-GPU" columns show which API abstraction of gunrock supports the respective application. If you are interested in helping us port an application in the older abstraction (v0.5.x) to a newer, much cleaner abstraction (v1.x.x), please see our Porting Guide.

    Application Directory Single GPU Multi-GPU
    A* Search astar v0.5.x
    Betweenness Centrality bc v1.x.x v0.5.x
    Breadth-First Search bfs v1.x.x v1.x.x
    Connected Components cc v1.x.x v0.5.x
    Graph Coloring color v1.x.x
    Geolocation geo v1.x.x
    RMAT Graph Generator grmat v0.5.x
    Graph Trend Filtering gtf v1.x.x
    Hyperlink-Induced Topic Search hits v1.x.x
    K-Nearest Neighbors knn v1.x.x
    Louvain Modularity louvain v1.x.x
    Label Propagation lp v0.5.x
    MaxFlow mf v1.x.x
    Minimum Spanning Tree mst v0.5.x
    PageRank pr v1.x.x v0.5.x
    Local Graph Clustering pr_nibble v1.x.x
    Graph Projections proj v1.x.x
    Random Walk rw v1.x.x
    GraphSAGE sage v1.x.x
    Stochastic Approach for Link-Structure Analysis salsa v0.5.x
    Subgraph Matching sm v1.x.x
    Shared Nearest Neighbors snn v1.x.x
    Scan Statistics ss v1.x.x
    Single Source Shortest Path sssp v1.x.x v0.5.x
    Triangle Counting tc v1.x.x
    Top K topk v0.5.x
    Vertex Nomination vn v1.x.x
    Who To Follow wtf v0.5.x

    Building Gunrock

    Gunrock's current release has been tested on Ubuntu 16.04 and 18.04 with CUDA 9+, compute architecture 3.0+ and g++ 4.8+. We expect Gunrock to build and run correctly on other 64-bit and 32-bit Linux distributions, and Mac OSX. We have an active issue investigating problems related to building Gunrock on Windows.

    Installation

    Prerequisites

    Required Dependencies:

    Optional Dependencies:

    Compilation

    Simple Gunrock Compilation:

    Downloading gunrock

    # Using git (recursively) download gunrock
    git clone --recursive https://github.com/gunrock/gunrock
    # Using wget to download gunrock
    wget --no-check-certificate https://github.com/gunrock/gunrock/archive/master.zip
    

    Compiling gunrock

    cd gunrock
    mkdir build && cd build
    cmake ..
    make
    

    Advance Gunrock Compilation:

    You can also compile gunrock with more specific/advanced settings using cmake -D[OPTION]=ON/OFF. Following options are available:

    mkdir build && cd build
    cmake -DGUNROCK_BUILD_APPLICATIONS=OFF -DGUNROCK_APP_BFS=ON ..
    make
    

    Generating Datasets

    All dataset-related code is under the dataset subdirectory. The current version of Gunrock only supports Matrix-market coordinate-formatted graph format. The datasets are divided into two categories according to their scale. Under the dataset/small/ subdirectory, there are trivial graph datasets for testing the correctness of the graph primitives. All of them are ready to use. Under the dataset/large/ subdirectory, there are large graph datasets for doing performance regression tests. * To download them to your local machine, just type make in the dataset/large/ subdirectory. * You can also choose to only download one specific dataset to your local machine by stepping into the subdirectory of that dataset and typing make inside that subdirectory.

    Hardware

    Laboratory Tested Hardware: Gunrock with GTX 970, Tesla K40s, GTX 1080, Tesla P100, RTX 2070, Tesla V100 and other NVIDIA cards. We have not encountered any trouble in-house with devices with CUDA capability >= 3.0.

    Methodology

    Methodology for Graph Analytics Performance

    We welcome comments from others on the methodology that we use for measuring Gunrock's performance.

    Currently, Gunrock is a library that requires no preprocessing. By this we mean that Gunrock inputs graphs in a "standard" format, e.g., compressed sparse row or coordinate, such as those available on common graph repositories (SNAP or SuiteSparse (UF)). In our experiments, we use MatrixMarket format.

    Other graph libraries may benefit from preprocessing of input datasets. We would regard any manipulation of the input dataset (e.g., reordering the input or more sophisticated preprocessing such as graph coloring or CuSha's G-Shards) to be preprocessing. We think preprocessing is an interesting future direction for Gunrock, but have not yet investigated it. We hope that any graph libraries that do preprocessing report results with both preprocessed and unmodified input datasets.

    (That being said, we do standardize input graphs in two ways: before running our experiments, we remove self-loops/duplicated edges. If the undirected flag is set, we convert the input graph to undirected. When we do so, that implies one edge in each direction, and we report edges for that graph accordingly. What we do here appears to be standard practice.)

    In general, we try to report results in two ways:

    To calculate TEPS, we require the number of edges traversed (touched), which we count dynamically. For traversal primitives, we note that non-connected components will not be visited, so the number of visited edges may be fewer than the number of edges in the graph. We note that precisely counting edges during the execution of a particular primitive may have performance implications, so we may approximate (see BFS).

    Notes on some of the Gunrock primitives follow.

    Breadth-First Search (BFS)

    When we count the number of edges traversed, we do so by summing the number of outbound edges for each visited vertex. For forward, non-idempotent BFS, this strategy should give us an exact count, since this strategy visits every edge incident to a visited vertex. When we enable idempotence, we may visit a node more than once and hence may visit an edge more than once. For backward (pull) BFS, when we visit a vertex, we count all edges incoming to that vertex even if we find a visited predecessor before traversing all edges (and terminate early). (To do so otherwise has performance implications.) Enterprise uses the same counting strategy.

    If a comparison library does not measure MTEPS for BFS, we compute it by the number of edges visited divided by runtime; if the former is not available, we use Gunrock's edges-visited count.

    Single Source Shortest Path (SSSP)

    In general we find MTEPS comparisons between different approaches to SSSP not meaningful: because an edge may be visited one or many times, there is no standard way to count edges traversed. Different algorithms may not only visit a very different number of edges (Dijkstra vs. Bellman-Ford will have very different edge visit counts) but may also have a different number of edges visited across different invocations of the same primitive.

    When we report Gunrock's SSSP MTEPS, we use the number of edges queued as the edge-traversal count.

    To have a meaningful SSSP experiment, it is critical to have varying edge weights. SSSP measured on uniform edge weights is not meaningful (it becomes BFS). In our experiments, we set edge weights randomly/uniformly between 1 and 64.

    Betweenness Centrality (BC)

    If a comparison library does not measure MTEPS for BC, we compute it by twice the number of edges visited in the forward phase divided by runtime (the same computation we use for Gunrock).

    PageRank (PR)

    We measure PageRank elapsed time on one iteration of PageRank. (Many other engines measure results this way and it is difficult to extrapolate from this measurement to runtime of the entire algorithm.)

    Results and Analysis

    We are gradually adding summaries of our results to these web pages (please let us know if you would like other comparisons). These summaries also include a table of results along with links to the configuration and results of each individual run. We detail our methodology for our measurements here.

    For reproducibility, we maintain Gunrock configurations and results in our github gunrock/io repository.

    We are happy to run experiments with other engines, particularly if those engines output results in our JSON format / a format that can be easily parsed into JSON format.

    Publications

    Muhammad Osama, Minh Truong, Carl Yang, Aydin Buluç and John D. Owens. Graph Coloring on the GPU. In Proceedings of the 33rd IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW '19, pages 231–240, May 2019. [DOI | http]

    Yuechao Pan, Roger Pearce, and John D. Owens. Scalable Breadth-First Search on a GPU Cluster. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018, May 2018. [http]

    Yangzihao Wang, Yuechao Pan, Andrew Davidson, Yuduo Wu, Carl Yang, Leyuan Wang, Muhammad Osama, Chenshan Yuan, Weitang Liu, Andy T. Riffel, and John D. Owens. Gunrock: GPU Graph Analytics. ACM Transactions on Parallel Computing, 4(1):3:1–3:49, August 2017. [DOI | http]

    Yuechao Pan, Yangzihao Wang, Yuduo Wu, Carl Yang, and John D. Owens. Multi-GPU Graph Analytics. In Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium, IPDPS 2017, pages 479–490, May/June 2017. [DOI | http]

    Yangzihao Wang, Sean Baxter, and John D. Owens. Mini-Gunrock: A Lightweight Graph Analytics Framework on the GPU. In Graph Algorithms Building Blocks, GABB 2017, pages 616–626, May 2017. [DOI | http]

    Leyuan Wang, Yangzihao Wang, Carl Yang, and John D. Owens. A Comparative Study on Exact Triangle Counting Algorithms on the GPU. In Proceedings of the 1st High Performance Graph Processing Workshop, HPGP '16, pages 1–8, May 2016. [DOI | http]

    Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D. Owens. Gunrock: A High-Performance Graph Processing Library on the GPU. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '16, pages 11:1–11:12, March 2016. Distinguished Paper. [DOI | http]

    Yuduo Wu, Yangzihao Wang, Yuechao Pan, Carl Yang, and John D. Owens. Performance Characterization for High-Level Programming Models for GPU Graph Analytics. In IEEE International Symposium on Workload Characterization, IISWC-2015, pages 66–75, October 2015. Best Paper finalist. [DOI | http]

    Carl Yang, Yangzihao Wang, and John D. Owens. Fast Sparse Matrix and Sparse Vector Multiplication Algorithm on the GPU. In Graph Algorithms Building Blocks, GABB 2015, pages 841–847, May 2015. [DOI | http]

    Afton Geil, Yangzihao Wang, and John D. Owens. WTF, GPU! Computing Twitter's Who-To-Follow on the GPU. In Proceedings of the Second ACM Conference on Online Social Networks, COSN '14, pages 63–68, October 2014. [DOI | http]

    Presentations

    GrAPL 2019, Graph Coloring on the GPU, May 2019. [slides]

    GTC 2018, Latest Development of the Gunrock Graph Processing Library on GPUs, March 2018. [slides | video]

    GTC 2018, Writing Graph Primitives with Gunrock, March 2018. [slides | video]

    GTC 2016, Gunrock: A Fast and Programmable Multi-GPU Graph Processing Library, April 2016. [slides]

    NVIDIA webinar, April 2016. [slides]

    GPU Technology Theater at SC15, Gunrock: A Fast and Programmable Multi-GPU Graph processing Library, November 2015. [slides | video]

    GTC 2014, High-Performance Graph Primitives on the GPU: design and Implementation of Gunrock, March 2014. [slides | video]

    Road Map

    We have three principal near-term priorities for Gunrock development (as of June 2019):

    We also continue to pursue projects in scalability (within a node, across nodes, and across the CPU-GPU boundary); in programmability, including working with the MIT GraphIt team; in improving Gunrock's internals, both their performance and their modularity; in expanding the number and coverage of applications written in Gunrock; and in better interoperability with other tools, languages, and libraries.

    Possible Gunrock projects

    Possible projects are in two categories: infrastructure projects that make Gunrock better but have minimal research value, and research projects that are longer-term and hopefully have research implications of use to the community.

    For any discussion on these, please use the existing Github issue (or make one).

    Infrastructure projects

    Research projects

    Gunrock Developers

    Frequently Asked Questions

    Some of the most common questions we have come across during the life of Gunrock project. If your question isn't already answered below, feel free to create an issue on GitHub.

    What does it do?

    Gunrock is a fast and efficient graph processing library on the GPU that provides a set of graph algorithms used in big data analytics and visualization with high performance. It also provides a set of operators which abstract the general operations in graph processing for other developers to build high-performance graph algorithm prototypes with minimum programming effort.

    How does it do it?

    Gunrock takes advantage of the immense computational power available in commodity-level, off-the-shelf Graphics Processing Units (GPUs), originally designed to handle the parallel computational tasks in computer graphics, to perform graph traversal and computation in parallel on thousands of GPU's computing cores.

    Who should want this?

    Gunrock is built with two kinds of users in mind: The first kind of users are programmers who build big graph analytics and visualization projects and need to use existing graph primitives provided by Gunrock. The second kind of users are programmers who want to use Gunrock's high-level, programmable abstraction to express, develop, and refine their own (and often more complicated) graph primitives.

    When would Gunrock be a bad choice?

    If your graph is too small, it's a bad fit for Gunrock (and the GPU). For most workloads, a GPU isn't a good fit until the graph reaches at least a few thousand nodes. Conversely, if your graph is too large and doesn't fit into GPU memory (or, for single-node multi-GPU configurations, into the aggregate GPU memories of all GPUs on the node), it's also a bad fit for Gunrock. Finally, Gunrock also does not currently support multi-node (distributed) execution, although Gunrock would probably be a good single-node component of a future distributed graph framework. Both the out-of-core (scale-up) and multi-node (scale-out) problems are excellent research programs.

    What is the skill set users need to use it?

    For the first kind of users, C/C++ background is sufficient. We are also building Gunrock as a shared library with C interfaces that can be loaded by other languages such as Python and Julia. For the second kind of users, they need to have the C/C++ background and also an understanding of parallel programming, especially BSP (Bulk-Synchronous Programming) model used by Gunrock.

    What platforms/languages do people need to know in order to modify or integrate it with other tools?

    Using the exposed interface, the users do not need to know CUDA to modify or integrate Gunrock to their own tools. However, an essential understanding of parallel programming and BSP model is necessary if one wants to add/modify graph primitives in Gunrock.

    Why would someone want this?

    The study of social networks, webgraphs, biological networks, and unstructured meshes in scientific simulation has raised a significant demand for efficient parallel frameworks for processing and analytics on large-scale graphs. Initial research efforts in using GPUs for graph processing and analytics are promising.

    How is it better than the current state of the art?

    Gunrock delivers best-in-class performance at a low cost with a high-level, productive, flexible programming model that enables writing a large number of graph applications. GPUs are becoming ubiquitous not just on the desktop but also in the cloud; Gunrock is competitive with much larger and more expensive distributed graph frameworks at a much lower cost. It generally outperforms the best CPU frameworks across its application suite, and among GPU frameworks, enables both high performance and high productivity.

    How would someone get it?

    Gunrock is an open-source library. The code, documentation, and quick start guide are all on its GitHub page.

    Is a user account required?

    No. One can use either git clone or download directly to get the source code and documentation of Gunrock.

    Are all of its components/dependencies easy to find?

    Gunrock has two dependencies. Both of them are GPU primitive libraries which also reside on GitHub. All dependencies do not require installation. To use, one only needs to download or git clone them and put them in the according directories. More details in the installation section of this documentation.

    How would someone install it?

    For a C/C++ programmer, integrating Gunrock into your projects is easy. Since it is a template based library, just add the include files in your code. The simple example and all the testrigs will provide detailed information on how to do this.

    For programmers who use Python, Julia, or other language and want to call Gunrock APIs, we are building a shared library with binary compatible C interfaces. It will be included in the soon-to-arrive next release of Gunrock.

    Can anyone install it? Do they need IT help?

    Gunrock is targeted at developers who are familiar with basic software engineering. For non-technical people, IT help might be needed.

    Does this process actually work? All the time? On all systems specified?

    Currently, Gunrock has been tested on two Linux distributions: Linux Mint and Ubuntu. But we expect it to run correctly on other Linux distributions too. We expect a Mac build would work (it has in the past), but don't currently have a Mac+NVIDIA machine on which we can test. Windows is not currently supported (we welcome pull requests that would allow us to support Windows).

    How would someone test that it's working with provided sample data?

    Testrigs are provided as well as a small simple example for users to test the correctness and performance of every graph primitive.

    Is the "using" of sample data clear?

    On Linux, one only needs to go to the dataset directory and run "make"; the script will automatically download all the needed datasets. One can also choose to download a single dataset in its separate directory.

    How would someone use it with their own data?

    Gunrock supports Matrix Market (.mtx) file format; users need to pre-process the graph data into this format before running Gunrock.

    Release Notes

    Please see our release notes and current releases.

    Acknowledgments

    Thanks to the following developers who contributed code: The connected-component implementation was derived from code written by Jyothish Soman, Kothapalli Kishore, and P. J. Narayanan and described in their IPDPSW '10 paper A Fast GPU Algorithm for Graph Connectivity (DOI). The breadth-first search implementation and many of the utility functions in Gunrock are derived from the b40c library of Duane Merrill. The algorithm is described in his PPoPP '12 paper Scalable GPU Graph Traversal (DOI). Thanks to Erich Elsen and Vishal Vaidyanathan from Royal Caliber and the Onu Team for their discussion on library development and the dataset auto-generating code. Thanks to Adam McLaughlin for his technical discussion. Thanks to Oded Green for his technical discussion and an optimization in the CC primitive. Thanks to the Altair and Vega-lite teams in the Interactive Data Lab at the University of Washington for graphing help. We appreciate the technical assistance, advice, and machine access from many colleagues at NVIDIA: Chandra Cheij, Joe Eaton, Michael Garland, Mark Harris, Ujval Kapasi, David Luebke, Duane Merrill, Josh Patterson, Nikolai Sakharnykh, and Cliff Woolley.

    This work was funded by the DARPA HIVE program under AFRL Contract FA8650-18-2-7835, the DARPA XDATA program under AFRL Contract FA8750-13-C-0002, by NSF awards OAC-1740333, CCF-1629657, OCI-1032859, and CCF-1017399, by DARPA STTR award D14PC00023, and by DARPA SBIR award W911NF-16-C-0020. Our XDATA principal investigator was Eric Whyne of Data Tactics Corporation and our DARPA program manager is Mr. Wade Shen (since 2015), and before that Dr. Christopher White (2012–2014). Thanks to Chris, Wade, and DARPA business manager Gabriela Araujo for their support during the XDATA program.