GraphSAGE#

GraphSAGE is a way to fit graphs into a neural network: instead of getting the embedding of a vertex from all its neighbors’ features as in conventional implementations, GraphSAGE selects some 1-hop neighbors, some 2-hop neighbors connected to those 1-hop neighbors, and computes the embedding based on the features of the 1-hop and 2-hop neighbors. The embedding can be considered as a vector containing hash values describing the interesting properties of a vertex.

During the training process, the adjacency matrix and features of the nodes in the graph are fed into a neural network. The parameters (the W arrays in the algorithm) are updated after each batch by the difference between the predicted labels and the real labels. The per-vertex features won’t change, but the parameters will, so the GraphSAGE computation needs to be performed for each batch. Ultimately it should connect to the training process to complete a workflow. However, the training part is pure matrix operations, and the year 1 deliverable only focuses on the graph related portion, which is the GraphSAGE implementation.

Summary of Results#

The vertex embedding part of the GraphSAGE algorithm is implemented in the Gunrock framework using custom CUDA kernels, utilizing block-level parallelism, that allow a shorter running time. For the embedding part alone, the GPU implementation is 7.5X to 15X on P100, and 20X to 30X on V100, faster than an OpenMP implementation using 32 threads. The GPU hardware, especially the memory system, has high utilizations from these custom kernels. It is still unclear how to expose block-level parallelism for more general usage in other applications in Gunrock.

Connecting the vertex embedding with the neural network training part, and making the GraphSAGE workflow complete, would be an interesting task for year 2. Testing on the complete workflow for prediction accuracy and running speed will be more meaningful.

Summary of Gunrock Implementation#

Gunrock’s implementation for the 1st year is only the embedding / inferencing phase, without the training phase. It is based on Algorithm 2 with K=2 of the GraphSAGE paper (“Inductive Representation Learning on Large Graphs”, https://arxiv.org/abs/1706.02216).

Given a graph G, the inputs are per-vertex features, weight matrices W^k, and a non-linear activation function (ReLu); the output from GraphSAGE is the embedding vector of each vertex. The current Gunrock implementation randomly selects neighbors from the neighborhood, and simple changes to the code can enable other selection methods, such as weighted uniform, importance sampling (used by FaseGCN), or a random walk probability like DeepWalk or Node2Vec (used by PinSage). An aggregator is a function to accumulate data from the selected neighbors; the current implementation uses the Mean aggregator, and it can be changed to other accumulation functions easily.

The pseudocode of Gunrock’s implementation follows. The current implementation uses custom CUDA kernels, because block-level parallelism is critical for faster running speed; all other functions, such as memory management, load and store routines, block level parallel reduction, and graph accesses (get degree, get neighbors, etc.) are provided by the framework or the utility functions of Gunrock. The GraphSAGE algorithm uses B2, B1 and B0 for the sources, the 1-hop neighbors and the 2-hop neighbors; for easier understanding of the code, we use sources, children and leafs for these three groups of vertices instead. To manage the memory usage of intermediate data, batches of source vertices are processed one by one, with each of size B.

source_start := 0;
While (source_source < V)
    num_sources := (source_start + B > V) ? V - source_start : B;

    // Kernel1: pick children and leafs
    For children# from 0 to num_children_per_source x num_sources:
        source := source_start + (children# / num_children_per_source);
        child := SelectNeighbor(source);
        leafs_feature := {0};
        For i from 1 to num_leafs_per_child:
            leaf := SelectNeighbor(child);
            leafs_feature += feature[leaf];

        children[children#] := child;
        leafs_features[children#] := leafs_feature;

    // Kernel2: Child-centric computation
    For children# from 0 to num_children_per_source x num_sources:
        child_temp := ReLu(concatenate(
            feature[children[children#]] x Wf1,
            Mean(leafs_features[children#]) x Wa1));
        child_temp /= sqrt(sum(child_temp));
        children_temp    [children# / num_children_per_source] += child_temp;
        children_features[children# / num_children_per_source] += feature[child];

    // Kernel3: Source-centric computation
    For source# from 0 to num_sources - 1:
        source := source_start + source#;
        source_temp := ReLu(concatenate(
            feature[source] x Wf1,
            Mean(children_features[source#] x Wa1)));
        source_temp /= sqrt(sum(source_temp));

        result := ReLu(concatenate(
            source_temp x Wf2,
            Mean(children_temp[source#]) x Wa2));
        result /= sqrt(sum(result));
        source_embedding[source#] := result;

    Move source_embedding from GPU to CPU;
    source_start += B;

How To Run This Application on DARPA’s DGX-1#

Prereqs/input#

CUDA should have been installed; $PATH and $LD_LIBRARY_PATH should have been set correctly to use CUDA. The current Gunrock configuration assumes boost (1.58.0 or 1.59.0) and Metis are installed; if not, changes need to be made in the Makefiles. DARPA’s DGX-1 has both installed when the tests are performed.

git clone --recursive https://github.com/gunrock/gunrock/ \
  -b dev-refactor
cd gunrock
git submodule init
git submodule update
mkdir build
cd build
cmake ..
cd ../tests/sage
make

At this point, there should be an executable test_sage_<CUDA version>_x86_64 in tests/sage/bin.

The datasets are assumed to have been placed in /raid/data/hive, and converted to proper matrix market format (.mtx). At the time of testing, pokec, amazon, flickr, twitter and cit-Patents are available in that directory.

Note that GraphSage is an inductive representation learning algorithm, so it reasonable to assume that there is no dangling vertices in the graph. Before running GraphSAGE, please remove the dangling vertices from the graph. In the case when dangling vertices are present, the dangling vertices themselves will be treated as their neighbors.

The testing is done with Gunrock using dev-refactor branch at commit 0ed72d5 (Oct. 25, 2018), using CUDA 9.2 with NVIDIA driver 390.30 on a Tesla P100 GPU in the DGX-1 machine, and using CUDA 10.0 with NVIDIA driver 410.66 on a Tesla V100 GPU in Bowser (an machine used by the Gunrock team in UC Davis).

Running the application#

Application specific parameters#

The W arrays used by GraphSAGE should be produced by the training process; without the training part at the moment, we use some given datasets or randomly generate them if not available. The array input files are floating-point values in plain text format; examples can be found in the gunrock/app/sage directory of the gunrock repo.

--Wa1 : std::string, default =
        <weight matrix for W^1 matrix in algorithm 2, aggregation part>
         dimension 64 by 128 for pokec;
         It should be leaf feature length by a value you want for W2 layer
--Wa1-dim1 : int, default = 128
        Wa1 matrix column dimension
--Wa2 : std::string, default =
        <weight matrix for W^2 matrix in algorithm 2, aggregation part>
         dimension 256 by 128 for pokec;
         It should be child_temp length by output length
--Wa2-dim1 : int, default = 128
        Wa2 matrix column dimension
--Wf1 : std::string, default =
        <weight matrix for W^1 matrix in algorithm 2, feature part>
         dimension 64 by 128 for pokec;
         It should be child feature length by a value you want for W2 layer
--Wf1-dim1 : int, default = 128
        Wf1 matrix column dimension
--Wf2 : std::string, default =
        <weight matrix for W^2 matrix in algorithm 2, feature part>
         dimension 256 by 128 for pokec;
         It should be source_temp length by output length
--Wf2-dim1 : int, default = 128
        Wf2 matrix column dimension
--feature-column : std::vector<int>, default = 64
        feature column dimension
--features : std::string, default =
        <features matrix>
    dimension |V| by 64 for pokec;
--omp-threads : int, default = 32
    number of threads to run CPU reference
--num-children-per-source : std::vector<int>, default = 10
        number of sampled children per source
--num-leafs-per-child : std::vector<int>, default = -1
        number of sampled leafs per child; default is the same as num-children-per-source
--batch-size : std::vector<int>, default = 65536
        number of source vertex to process in one iteration

Example Command#

./bin/test_sage_9.2_x86_64 \
 market /raid/data/hive/pokec/pokec.mtx  --undirected \
 --Wf1 ../../gunrock/app/sage/wf1.txt \
 --Wa1 ../../gunrock/app/sage/wa1.txt \
 --Wf2 ../../gunrock/app/sage/wf2.txt \
 --Wa2 ../../gunrock/app/sage/wa2.txt \
 --features ../../gunrock/app/sage/features.txt \
 --num-runs=10 \
 --batch-size=16384

When Wf1, Wa1, Wf2, Wa2, or features are not available, random values are used. The embeddings may be not useful at all, but the memory access patterns and the computation workload are the same, regardless of whether the random inputs are used. Using the random inputs enable us to test on any graph without requiring the associative arrays.

Output#

The outputs are in the sage directory; look for the .txt files. An example is shown below for the pokec dataset:

Loading Matrix-market coordinate-formatted graph ...
  ... Loading progress ...
Converting 1632803 vertices, 44603928 directed edges ( ordered tuples) to CSR format...Done (0s).
Degree Histogram (1632803 vertices, 44603928 edges):
    Degree 0: 0 (0.000000 %)
    Degree 2^0: 163971 (10.042301 %)
    Degree 2^1: 201604 (12.347111 %)
    Degree 2^2: 238757 (14.622523 %)
    Degree 2^3: 273268 (16.736128 %)
    Degree 2^4: 298404 (18.275567 %)
    Degree 2^5: 265002 (16.229882 %)
    Degree 2^6: 148637 (9.103180 %)
    Degree 2^7: 38621 (2.365319 %)
    Degree 2^8: 4089 (0.250428 %)
    Degree 2^9: 418 (0.025600 %)
    Degree 2^10: 23 (0.001409 %)
    Degree 2^11: 3 (0.000184 %)
    Degree 2^12: 3 (0.000184 %)
    Degree 2^13: 3 (0.000184 %)

==============================================
 feature-column=64 num-children-per-source=10 num-leafs-per-child=-1
Computing reference value ...
==============================================
rand-seed = 1540498523
==============================================
CPU Reference elapsed: 25242.941406 ms.
Embedding validation: PASS
==============================================
 batch-size=512
Using randomly generated Wf1
Using randomly generated Wa1
Using randomly generated Wf2
Using randomly generated Wa2
Using randomly generated features
Using advance mode LB
Using filter mode CULL
rand-seed = 1540498553
==============================================
==============================================
Run 0 elapsed: 2047.366858 ms, #iterations = 3190
... More runs
==============================================
==============================================
Run 9 elapsed: 2013.216972 ms, #iterations = 3190
Embedding validation: PASS
[Sage] finished.
 avg. elapsed: 2042.503190 ms
 iterations: 3190
 min. elapsed: 2009.524107 ms
 max. elapsed: 2243.710995 ms
 load time: 1600.46 ms
 preprocess time: 3997.880000 ms
 postprocess time: 2106.487989 ms
 total time: 26634.360075 ms

There is 1 OMP reference run on the CPU for each combination of {feature-column, num-children-per-source, num-leafs-per-child}, with the timing reported after CPU Reference elapsed. There are 10 GPU runs for each combination of the previous three parameters, plus batch-size. The computation workload of the GPU runs are the same as the reference CPU run, with different batch sizes. The GPU timing is reported after Run x elapsed:, and the average running time of the 10 GPUs is reported after avg. elapsed.

The mathematical formula of the GraphSAGE algorithm is relatively simple and repeats itself three times in the form of Normalize(C x Wf + Mean(D) x Wa). Because of this simplicity, it is still possible to verify the implementation by visually inspecting the code. The resulted embeddings are also checked for the L2 norm, which should be close to 1 for every vertex. Because the neighbor selection process is inherently random, it would be very difficult to do a number-by-number checking with other implementations, including the reference. A more meaningful regression test will look at the training or validation accuracy when the full workflow is completed, which is a possibility for future work in year 2.

Performance and Analysis#

The OpenMP and GPU implementations are measured for runtime. It would be additionally useful to validate the successful rate of the whole pipeline with the training process in place (perhaps in year 2).

The datasets used for experiments are:

Dataset	V	E
flickr	105938	4633896
amazon	548551	1851744
pokec	1632803	44603928
cit-Patents	3774768	33037894
twitter	7199978	43483326
europe-osm	50912018	108109320

The running times in milliseconds are listed below, for both machines. F stands for the length of features, C stands for the number of children per source, the same as the number of leafs per child, and B notes the batch size that produces the shortest running time on GPU among {512, 1024, 2048, 4096, 8192, 16384}.

The running time on DGX-1 with Tesla P100 GPU:

Dataset	F	C	B	Gunrock GPU	OpenMP	Speedup
flickr	64	10	16384	113.431	1607.468	14.17
flickr	64	25	16384	248.523	2630.659	10.59
flickr	64	100	2048	1139.007	10702.981	9.40
flickr	128	10	16384	192.916	1800.150	9.33
flickr	128	25	8192	442.226	4022.799	9.10
flickr	128	100	2048	2116.955	20655.732	9.76
amazon	64	10	16384	579.342	8905.530	15.37
amazon	64	25	16384	1235.168	18034.229	14.60
amazon	64	100	4096	5486.910	45337.011	8.26
amazon	128	10	16384	976.759	9418.961	9.64
amazon	128	25	16384	2193.159	29645.709	13.52
amazon	128	100	4096	10112.228	80677.594	7.98
pokec	64	10	16384	1744.602	25242.941	14.47
pokec	64	25	16384	3806.404	34517.180	9.07
pokec	64	100	2048	17486.692	133920.000	7.66
pokec	128	10	16384	2950.606	41414.535	14.04
pokec	128	25	8192	6791.829	74983.391	11.04
pokec	128	100	2048	32000.378	297979.438	9.31
cit-Patents	64	10	16384	4005.860	52094.125	13.00
cit-Patents	64	25	16384	8541.500	93715.789	10.97
cit-Patents	64	100	4096	38494.688	293333.781	7.62
cit-Patents	128	10	16384	6784.752	95639.117	14.10
cit-Patents	128	25	8192	15250.170	146539.250	9.61
cit-Patents	128	100	4096	70074.883	530910.375	7.58
twitter	64	10	16384	7594.364	103753.961	13.66
twitter	64	25	16384	16212.790	162819.531	10.04
twitter	64	100	4096	73488.745	636224.625	8.66
twitter	128	10	16384	12753.125	147705.375	11.58
twitter	128	25	16384	28827.354	282497.438	9.80
twitter	128	100	4096	134027.966	1105741.500	8.25
europe_osm	64	10	16384	53521.982	611449.625	11.42
europe_osm	64	25	16384	113740.739	1016479.938	8.94
europe_osm	64	100	4096	509313.472	4441408.500	8.72

The running time on Bowser with a Tesla V100 GPU

Dataset	F	C	B	Gunrock GPU	OpenMP	Speedup vs. CPU	Speedup vs. P100
flickr	64	10	16384	39.894	1094.377	27.43	2.90
flickr	64	25	8192	91.810	2177.953	23.72	2.76
flickr	64	100	1024	448.888	8641.063	19.25	2.57
flickr	128	10	16384	65.131	1849.658	28.40	2.95
flickr	128	25	4096	156.965	3981.435	25.37	2.84
flickr	128	100	512	802.966	16471.338	20.51	2.70
amazon	64	10	16384	196.079	6142.780	31.33	2.95
amazon	64	25	8192	423.145	12674.514	29.95	2.92
amazon	64	100	2048	1963.691	49205.359	25.06	2.79
amazon	128	10	16384	324.067	9978.198	30.79	3.10
amazon	128	25	8192	733.181	21726.697	29.63	2.99
amazon	128	100	2048	3361.894	86967.164	25.87	3.01
pokec	64	10	16384	602.116	17111.664	28.42	2.84
pokec	64	25	8192	1379.450	35887.129	26.02	2.71
pokec	64	100	1024	6793.011	140751.234	20.72	2.54
pokec	128	10	16384	1000.310	28987.953	28.98	2.96
pokec	128	25	4096	2366.202	63863.352	26.99	2.82
pokec	128	100	512	11859.789	265325.281	22.37	2.64
cit-Patents	64	10	16384	1374.782	38803.195	28.22	2.95
cit-Patents	64	25	8192	3006.717	78112.516	25.98	2.87
cit-Patents	64	100	2048	13910.529	291278.125	20.94	2.78
cit-Patents	128	10	16384	2276.261	65735.234	28.88	2.99
cit-Patents	128	25	8192	5201.184	141957.922	27.29	2.96
cit-Patents	128	100	2048	23922.527	560321.438	23.42	2.94
twitter	64	10	16384	2575.303	72372.484	28.10	2.91
twitter	64	25	8192	5658.345	148938.422	26.32	2.84
twitter	64	100	2048	26468.915	569925.500	21.53	2.77
twitter	128	10	16384	4270.454	125431.766	29.37	2.98
twitter	128	25	8192	9744.686	267841.563	27.49	2.93
twitter	128	100	2048	45604.444	1098039.375	24.08	2.93
europe_osm	64	10	16384	18440.253	497153.969	26.96	2.90
europe_osm	64	25	16384	39883.421	1008675.500	25.29	2.85
europe_osm	64	100	4096	184128.123	3825227.500	20.77	2.77

Implementation limitations#

Memory usage Gunrock’s GPU implementation uses B x (C x (Wf2.x + F + 1) + 2Wf2.x + 2R.x) x 4 bytes in additional to the features that takes up VF x 4 bytes and the graph itself. Because the batch size B can be adjusted, the main memory consumption is from the feature array. For a P100 with 16 GB memory, if the feature length is 64, the maximum number of vertices it can handle before hitting OOM is about 60 million. The largest dataset tested so far is the europe_osm dataset with 50.9M vertices and 108M edges.
Data types The vertex Ids and edge Ids are both presented as 32-bit unsigned integers. Input features, weights, output embeddings and intermedia results are represented as 32-bit floating-point numbers. A not so recent trend in machine learning research is to use less precision in neural networks. Half precision / 16-bit floating points are quite common, and supported by recent GPUs. Using half precision cuts the computation time to about half, as compared to single precision, and also cuts the memory usage to store the features in half. It would be interesting to see what would happen if the data type is changed to half precision.
Graph types The training process requires the graph to be undirected (enforced by --undirected in Gunrock’s command line parameters). The behavior of sampling a zero-length neighbor list is undefined, and currently it will return the source vertex itself.

Comparison against existing implementations#

Because computation on each vertex is independent from other vertices, GraphSAGE is an embarrassingly parallel problem when parallelized across vertices. Running a simple test with the pokec dataset, feature length as 64, num_children_per_source and num_leafs_per_child both at 10, the serial run (with omp-threads forced to 1) on the DGX-1 takes 366390.250 ms, as compared to 25242.941 ms using 32 threads; using 32 threads is about 14.5X faster than a single thread, which shows GraphSAGE scales pretty well on the CPU.

Comparing the running time of a 32-thread OpenMP and Gunrock’s GPU implementation, the P100 is about 7.5X to 15X faster. Increasing the feature length from 64 to 128 roughly doubles the computation workload, and the speedup does not change very much for datasets with more than about 1M vertices. However, increasing the number of children per source and the number of leafs per child decreases the speedup. The OMP’s running time increases slower than the number of neighbors selected, while the GPU’s running time increases faster than the number of neighbors selected. This may be attributed to the cache effect: the parallelism on CPU is limited, so data reuse rate is high, even when the number of neighbors increases; however, the number of source or children processed on the GPU at the same time is much larger, with a much bigger working set, and this decreases the cache hit rate, so results in longer running time.

Also interesting is comparing the runtime on V100 and P100: V100 is about 3X faster than P100 when running GraphSAGE. This large performance difference is caused by the different limiting factors when running GraphSAGE on these two GPUs; details are in the the Performance limitations section. Compared to OpenMP, the V100 is about 20X to 30X faster.

Performance limitations#

We profiled GraphSAGE with the pokec dataset on a Titan Xp GPU (profiling on P100 caused an internal error in the profiler itself; Titan Xp has roughly the same SM design as P100, but has only 30 SMs vs. 56 on the P100; P100 also has 16 GB HMB2 memory, and Titan Xp only has 12 GB GDDR5X; runtime on Titan Xp and P100 is similar). kernel2 takes up about 60% of the computation of an batch, 10.64 ms out of 17.76 ms for pokec with 64 feature length, 10 neighbors, and batch size at 16384. Kernel1 takes 1.56 ms, or 8.8%, and Kernel3 takes 5.52 ms, or 31.1%.

Further detail on the profile of Kernel2 on the Titan Xp shows the utilization of the memory system:

Type	Transactions	Bandwidth	Utilization
Shared Loads	2129920	31.997 GB/s
Shared Stores	1638400	24.613 GB/s
Shared Total	3768320	56.611 GB/s	Idle to low
Local Loads	0	0 GB/s
Local Stores	0	0 GB/s
Global Loads	2685009922	1575.871 GB/s
Global Stores	0	0 GB/s
Texture Reads	671252480	2521.025 GB/s
Unified Total	3356262402	4096.897 GB/s	High
L2 Reads	264170192	992.145 GB/s
L2 Writes	10485773	39.381 GB/s
L2 Total	274655965	1031.526 GB/s	Low to medium
Device Reads	2181833	8.194 GB/s
Device Writes	543669	2.042 GB/s
Device Total	2725502	10.236 GB/s	Idle to low

Sage_Pokec_TianXp

It’s clear that the unified cache is almost fully utilized, at 4 TBps out of the 5 TBps theoretical upper bound, and is the bottleneck. This is because the W arrays and the intermediate arrays are highly reusable.

Running the same experiment on the V100 shows a different picture within the memory system:

Type	Transactions	Bandwidth	Utilization
Shared Loads	2138599	74.166 GB/s
Shared Stores	1640842	56.904 GB/s
Shared Total	3779441	131.070 GB/s	Idle to low
Local Loads	0	0 GB/s
Local Stores	0	0 GB/s
Global Loads	419594240	3637.862 GB/s
Global Stores	0	0 GB/s
Texture Reads	177448960	6153.897 GB/s
Unified Total	597043200	9791.759 GB/s	Medium
L2 Reads	8209326	71.174 GB/s
L2 Writes	5243176	45.458 GB/s
L2 Total	274655965	116.633 GB/s	Idle to low
Device Reads	2402833	20.832 GB/s
Device Writes	571925	4.959 GB/s
Device Total	2974758	25.791 GB/s	Idle to low

Sage_Pokec_V100

On the V100, kernel2 only takes 3.982 ms (about 60% of 6.67 ms per batch), and the unified-cache throughput increases to 9.8 TBps, more than double than on the Titan Xp. In fact, the theoretical upper bound of V100’s L1 throughput is 28 TBps, resulted from doubling each SM’s load throughput from 128 bytes per cycle to 256 bytes per cycle, and increasing the SM count to 80. This is the reason why the V100 can outperform P100 and Titan Xp by about 3X. The performance bottleneck is no longer the memory system, and switches to integer computations, which comes mainly from array index calculation. This particular kernel also takes up 32 registers per thread, which is the limit for full GPU occupancy. Storing intermediate index calculation results would help if the register usage is not so high.

Next Steps#

Alternate approaches#

Things we tried that didn’t really work

An simple implementation that uses a thread to process the per-source or per-child computation is coded, but it runs about 10X slower than the current implementation that uses a block to process such units. One reason is the use of block-level parallel primitives, such as block reduce. Another reason is that by using a whole block, instead of a thread, to process the same computation, the working set is greatly reduced together with the parallelism, and the whole working set can fit into the cache system. When using higher parallelism, the working set is larger, and forced to be evicted into the global memory, and creates a bottleneck. Actually during the experiment, for the thread-level parallelism implementation, reducing the number of blocks of the kernel improves its running time, the opposite to normally what would be expected from running other kernels.

Gunrock implications#

One thing that Gunrock does not provide, or intentionally hides, is block-level parallelism. However, it comes in handy when implementing the custom kernels for GraphSAGE: each block can process a one-dimension vector, with each thread holding one element, then use a block-level reduce to get the sum of those elements; this way is highly efficient, and actually reduces the parallelism and with it the size of the working set of data.

Notes on multi-GPU parallelization#

The main memory usage is due to feature data, so it is critical to not duplicate features. The side effect is the computation needs to be divided into child-centric and source-centric parts, and exchange data in between. It should be scalable, and easy to implement.

Notes on dynamic graphs#

GraphSAGE does not have a dynamic graph component, but it should able to work on a dynamic graph. Some of the data may be reusable if the graph has not been significantly changed, but the resulting memory requirement to store the intermediate data may make data reuse impossible.

Notes on larger datasets#

If the dataset is so large that the graph and the per-vertex feature data are larger than the combined GPU memory, it’s possible to accumulate the features of leafs and children on CPU, transfer the data on to GPU, and perform the computation. Because of the cost of the transfer, runtime will increase significantly as compared to the cases when the full data can fit in the GPU memory, but whether that will cause an increase above the OpenMP implementation is still unknown.

Notes on other pieces of this workload#

The main part of GraphSAGE workflow is actually the training process, which will be outside of Gunrock, provided by TensorFlow, PyTorch or other machine learning libraries. How to connect the training with the Gunrock GPU implementation is the main task for this workload going forward.

Research potential#

The GPU implementation of the embedding part runs a lot faster than on the CPU, and hits a few GPU hardware limitations. It should have comparable runtime to other GPU implementations. But it’s only useful as a part of the whole workload and achieves comparable prediction accuracy as conventional / reference implementations.

An interesting question raised from the GraphSAGE GPU implementation is whether it is useful to expose block-level parallelism to higher-level programming models/APIs, and if so, how to do that. It’s clear that working at the block level provides benefits, such as the ability to use block-level primitives like scan and reduce. But it also comes with costs, most importantly, requiring the programmer to have knowledge about the GPU hardware. It also reduces the portability of the implementation, because two-level parallelism may not exist on other processors.

GraphSAGE

Contents