GraphSAGE#
GraphSAGE is a way to fit graphs into a neural network: instead of getting the embedding of a vertex from all its neighbors’ features as in conventional implementations, GraphSAGE selects some 1-hop neighbors, some 2-hop neighbors connected to those 1-hop neighbors, and computes the embedding based on the features of the 1-hop and 2-hop neighbors. The embedding can be considered as a vector containing hash values describing the interesting properties of a vertex.
During the training process, the adjacency matrix and features of the nodes in the graph are fed into a neural network. The
parameters (the W
arrays in the algorithm) are updated after each batch by the
difference between the predicted labels and the real labels. The per-vertex
features won’t change, but the parameters will, so the GraphSAGE computation needs
to be performed for each batch. Ultimately it should connect to the training
process to complete a workflow. However, the training part is pure matrix
operations, and the year 1 deliverable only focuses on the graph related
portion, which is the GraphSAGE implementation.
Summary of Results#
The vertex embedding part of the GraphSAGE algorithm is implemented in the Gunrock framework using custom CUDA kernels, utilizing block-level parallelism, that allow a shorter running time. For the embedding part alone, the GPU implementation is 7.5X to 15X on P100, and 20X to 30X on V100, faster than an OpenMP implementation using 32 threads. The GPU hardware, especially the memory system, has high utilizations from these custom kernels. It is still unclear how to expose block-level parallelism for more general usage in other applications in Gunrock.
Connecting the vertex embedding with the neural network training part, and making the GraphSAGE workflow complete, would be an interesting task for year 2. Testing on the complete workflow for prediction accuracy and running speed will be more meaningful.
Summary of Gunrock Implementation#
Gunrock’s implementation for the 1st year is only the embedding / inferencing phase, without the training phase. It is based on Algorithm 2 with K=2 of the GraphSAGE paper (“Inductive Representation Learning on Large Graphs”, https://arxiv.org/abs/1706.02216).
Given a graph G, the inputs are per-vertex features, weight matrices W^k, and a non-linear activation function (ReLu); the output from GraphSAGE is the embedding vector of each vertex. The current Gunrock implementation randomly selects neighbors from the neighborhood, and simple changes to the code can enable other selection methods, such as weighted uniform, importance sampling (used by FaseGCN), or a random walk probability like DeepWalk or Node2Vec (used by PinSage). An aggregator is a function to accumulate data from the selected neighbors; the current implementation uses the Mean aggregator, and it can be changed to other accumulation functions easily.
The pseudocode of Gunrock’s implementation follows. The current
implementation uses custom CUDA kernels, because block-level parallelism is critical for
faster running speed; all other functions, such as memory management,
load and store routines, block level parallel reduction, and graph accesses
(get degree, get neighbors, etc.) are provided by the framework or the utility
functions of Gunrock. The GraphSAGE algorithm uses B2, B1 and B0 for the
sources, the 1-hop neighbors and the 2-hop neighbors; for
easier understanding of the code, we use sources
, children
and leafs
for
these three groups of vertices instead. To manage the memory usage of
intermediate data, batches of source vertices are processed one by one, with each
of size B.
source_start := 0;
While (source_source < V)
num_sources := (source_start + B > V) ? V - source_start : B;
// Kernel1: pick children and leafs
For children# from 0 to num_children_per_source x num_sources:
source := source_start + (children# / num_children_per_source);
child := SelectNeighbor(source);
leafs_feature := {0};
For i from 1 to num_leafs_per_child:
leaf := SelectNeighbor(child);
leafs_feature += feature[leaf];
children[children#] := child;
leafs_features[children#] := leafs_feature;
// Kernel2: Child-centric computation
For children# from 0 to num_children_per_source x num_sources:
child_temp := ReLu(concatenate(
feature[children[children#]] x Wf1,
Mean(leafs_features[children#]) x Wa1));
child_temp /= sqrt(sum(child_temp));
children_temp [children# / num_children_per_source] += child_temp;
children_features[children# / num_children_per_source] += feature[child];
// Kernel3: Source-centric computation
For source# from 0 to num_sources - 1:
source := source_start + source#;
source_temp := ReLu(concatenate(
feature[source] x Wf1,
Mean(children_features[source#] x Wa1)));
source_temp /= sqrt(sum(source_temp));
result := ReLu(concatenate(
source_temp x Wf2,
Mean(children_temp[source#]) x Wa2));
result /= sqrt(sum(result));
source_embedding[source#] := result;
Move source_embedding from GPU to CPU;
source_start += B;
How To Run This Application on DARPA’s DGX-1#
Prereqs/input#
CUDA should have been installed; $PATH
and $LD_LIBRARY_PATH
should have been
set correctly to use CUDA. The current Gunrock configuration assumes boost
(1.58.0 or 1.59.0) and Metis are installed; if not, changes need to be made in
the Makefiles. DARPA’s DGX-1 has both installed when the tests are performed.
git clone --recursive https://github.com/gunrock/gunrock/ \
-b dev-refactor
cd gunrock
git submodule init
git submodule update
mkdir build
cd build
cmake ..
cd ../tests/sage
make
At this point, there should be an executable test_sage_<CUDA version>_x86_64
in tests/sage/bin
.
The datasets are assumed to have been placed in /raid/data/hive
, and converted
to proper matrix market format (.mtx
). At the time of testing, pokec
, amazon
,
flickr
, twitter
and cit-Patents
are available in that directory.
Note that GraphSage is an inductive representation learning algorithm, so it reasonable to assume that there is no dangling vertices in the graph. Before running GraphSAGE, please remove the dangling vertices from the graph. In the case when dangling vertices are present, the dangling vertices themselves will be treated as their neighbors.
The testing is done with Gunrock using dev-refactor
branch at commit 0ed72d5
(Oct. 25, 2018), using CUDA 9.2 with NVIDIA driver 390.30 on a Tesla P100 GPU in
the DGX-1 machine, and using CUDA 10.0 with NVIDIA driver 410.66 on a Tesla V100
GPU in Bowser (an machine used by the Gunrock team in UC Davis).
Running the application#
Application specific parameters#
The W arrays used by GraphSAGE should be produced by the training process; without
the training part at the moment, we use some given datasets or randomly generate
them if not available. The array input files are floating-point values in plain
text format; examples can be found in the gunrock/app/sage
directory of the
gunrock repo.
--Wa1 : std::string, default =
<weight matrix for W^1 matrix in algorithm 2, aggregation part>
dimension 64 by 128 for pokec;
It should be leaf feature length by a value you want for W2 layer
--Wa1-dim1 : int, default = 128
Wa1 matrix column dimension
--Wa2 : std::string, default =
<weight matrix for W^2 matrix in algorithm 2, aggregation part>
dimension 256 by 128 for pokec;
It should be child_temp length by output length
--Wa2-dim1 : int, default = 128
Wa2 matrix column dimension
--Wf1 : std::string, default =
<weight matrix for W^1 matrix in algorithm 2, feature part>
dimension 64 by 128 for pokec;
It should be child feature length by a value you want for W2 layer
--Wf1-dim1 : int, default = 128
Wf1 matrix column dimension
--Wf2 : std::string, default =
<weight matrix for W^2 matrix in algorithm 2, feature part>
dimension 256 by 128 for pokec;
It should be source_temp length by output length
--Wf2-dim1 : int, default = 128
Wf2 matrix column dimension
--feature-column : std::vector<int>, default = 64
feature column dimension
--features : std::string, default =
<features matrix>
dimension |V| by 64 for pokec;
--omp-threads : int, default = 32
number of threads to run CPU reference
--num-children-per-source : std::vector<int>, default = 10
number of sampled children per source
--num-leafs-per-child : std::vector<int>, default = -1
number of sampled leafs per child; default is the same as num-children-per-source
--batch-size : std::vector<int>, default = 65536
number of source vertex to process in one iteration
Example Command#
./bin/test_sage_9.2_x86_64 \
market /raid/data/hive/pokec/pokec.mtx --undirected \
--Wf1 ../../gunrock/app/sage/wf1.txt \
--Wa1 ../../gunrock/app/sage/wa1.txt \
--Wf2 ../../gunrock/app/sage/wf2.txt \
--Wa2 ../../gunrock/app/sage/wa2.txt \
--features ../../gunrock/app/sage/features.txt \
--num-runs=10 \
--batch-size=16384
When Wf1
, Wa1
, Wf2
, Wa2
, or features
are not available, random values
are used. The embeddings may be not useful at all, but the memory access patterns
and the computation workload are the same, regardless of whether the random
inputs are used. Using the random inputs enable us to test on any graph without
requiring the associative arrays.
Output#
The outputs are in the sage
directory; look for the .txt files. An example is
shown below for the pokec
dataset:
Loading Matrix-market coordinate-formatted graph ...
... Loading progress ...
Converting 1632803 vertices, 44603928 directed edges ( ordered tuples) to CSR format...Done (0s).
Degree Histogram (1632803 vertices, 44603928 edges):
Degree 0: 0 (0.000000 %)
Degree 2^0: 163971 (10.042301 %)
Degree 2^1: 201604 (12.347111 %)
Degree 2^2: 238757 (14.622523 %)
Degree 2^3: 273268 (16.736128 %)
Degree 2^4: 298404 (18.275567 %)
Degree 2^5: 265002 (16.229882 %)
Degree 2^6: 148637 (9.103180 %)
Degree 2^7: 38621 (2.365319 %)
Degree 2^8: 4089 (0.250428 %)
Degree 2^9: 418 (0.025600 %)
Degree 2^10: 23 (0.001409 %)
Degree 2^11: 3 (0.000184 %)
Degree 2^12: 3 (0.000184 %)
Degree 2^13: 3 (0.000184 %)
==============================================
feature-column=64 num-children-per-source=10 num-leafs-per-child=-1
Computing reference value ...
==============================================
rand-seed = 1540498523
==============================================
CPU Reference elapsed: 25242.941406 ms.
Embedding validation: PASS
==============================================
batch-size=512
Using randomly generated Wf1
Using randomly generated Wa1
Using randomly generated Wf2
Using randomly generated Wa2
Using randomly generated features
Using advance mode LB
Using filter mode CULL
rand-seed = 1540498553
==============================================
==============================================
Run 0 elapsed: 2047.366858 ms, #iterations = 3190
... More runs
==============================================
==============================================
Run 9 elapsed: 2013.216972 ms, #iterations = 3190
Embedding validation: PASS
[Sage] finished.
avg. elapsed: 2042.503190 ms
iterations: 3190
min. elapsed: 2009.524107 ms
max. elapsed: 2243.710995 ms
load time: 1600.46 ms
preprocess time: 3997.880000 ms
postprocess time: 2106.487989 ms
total time: 26634.360075 ms
There is 1 OMP reference run on the CPU for each combination of {feature-column
,
num-children-per-source
, num-leafs-per-child
}, with the timing reported after
CPU Reference elapsed
. There are 10 GPU runs for each combination of the
previous three parameters, plus batch-size
. The computation workload of the
GPU runs are the same as the reference CPU run, with different batch sizes. The
GPU timing is reported after Run x elapsed:
, and the average running time of
the 10 GPUs is reported after avg. elapsed
.
The mathematical formula of the GraphSAGE algorithm is relatively simple and repeats
itself three times in the form of Normalize(C x Wf + Mean(D) x Wa)
. Because of
this simplicity, it is still possible to verify the implementation by visually
inspecting the code. The resulted embeddings are also checked for the L2 norm,
which should be close to 1 for every vertex. Because the neighbor selection
process is inherently random, it would be very difficult to do a number-by-number checking with other implementations, including the reference. A more
meaningful regression test will look at the training or validation
accuracy when the full workflow is completed, which is a possibility for
future work in year 2.
Performance and Analysis#
The OpenMP and GPU implementations are measured for runtime. It would be additionally useful to validate the successful rate of the whole pipeline with the training process in place (perhaps in year 2).
The datasets used for experiments are:
Dataset |
V |
E |
---|---|---|
flickr |
105938 |
4633896 |
amazon |
548551 |
1851744 |
pokec |
1632803 |
44603928 |
cit-Patents |
3774768 |
33037894 |
7199978 |
43483326 |
|
europe-osm |
50912018 |
108109320 |
The running times in milliseconds are listed below, for both machines. F stands for the length of features, C stands for the number of children per source, the same as the number of leafs per child, and B notes the batch size that produces the shortest running time on GPU among {512, 1024, 2048, 4096, 8192, 16384}.
The running time on DGX-1 with Tesla P100 GPU:
Dataset |
F |
C |
B |
Gunrock GPU |
OpenMP |
Speedup |
---|---|---|---|---|---|---|
flickr |
64 |
10 |
16384 |
113.431 |
1607.468 |
14.17 |
flickr |
64 |
25 |
16384 |
248.523 |
2630.659 |
10.59 |
flickr |
64 |
100 |
2048 |
1139.007 |
10702.981 |
9.40 |
flickr |
128 |
10 |
16384 |
192.916 |
1800.150 |
9.33 |
flickr |
128 |
25 |
8192 |
442.226 |
4022.799 |
9.10 |
flickr |
128 |
100 |
2048 |
2116.955 |
20655.732 |
9.76 |
amazon |
64 |
10 |
16384 |
579.342 |
8905.530 |
15.37 |
amazon |
64 |
25 |
16384 |
1235.168 |
18034.229 |
14.60 |
amazon |
64 |
100 |
4096 |
5486.910 |
45337.011 |
8.26 |
amazon |
128 |
10 |
16384 |
976.759 |
9418.961 |
9.64 |
amazon |
128 |
25 |
16384 |
2193.159 |
29645.709 |
13.52 |
amazon |
128 |
100 |
4096 |
10112.228 |
80677.594 |
7.98 |
pokec |
64 |
10 |
16384 |
1744.602 |
25242.941 |
14.47 |
pokec |
64 |
25 |
16384 |
3806.404 |
34517.180 |
9.07 |
pokec |
64 |
100 |
2048 |
17486.692 |
133920.000 |
7.66 |
pokec |
128 |
10 |
16384 |
2950.606 |
41414.535 |
14.04 |
pokec |
128 |
25 |
8192 |
6791.829 |
74983.391 |
11.04 |
pokec |
128 |
100 |
2048 |
32000.378 |
297979.438 |
9.31 |
cit-Patents |
64 |
10 |
16384 |
4005.860 |
52094.125 |
13.00 |
cit-Patents |
64 |
25 |
16384 |
8541.500 |
93715.789 |
10.97 |
cit-Patents |
64 |
100 |
4096 |
38494.688 |
293333.781 |
7.62 |
cit-Patents |
128 |
10 |
16384 |
6784.752 |
95639.117 |
14.10 |
cit-Patents |
128 |
25 |
8192 |
15250.170 |
146539.250 |
9.61 |
cit-Patents |
128 |
100 |
4096 |
70074.883 |
530910.375 |
7.58 |
64 |
10 |
16384 |
7594.364 |
103753.961 |
13.66 |
|
64 |
25 |
16384 |
16212.790 |
162819.531 |
10.04 |
|
64 |
100 |
4096 |
73488.745 |
636224.625 |
8.66 |
|
128 |
10 |
16384 |
12753.125 |
147705.375 |
11.58 |
|
128 |
25 |
16384 |
28827.354 |
282497.438 |
9.80 |
|
128 |
100 |
4096 |
134027.966 |
1105741.500 |
8.25 |
|
europe_osm |
64 |
10 |
16384 |
53521.982 |
611449.625 |
11.42 |
europe_osm |
64 |
25 |
16384 |
113740.739 |
1016479.938 |
8.94 |
europe_osm |
64 |
100 |
4096 |
509313.472 |
4441408.500 |
8.72 |
The running time on Bowser with a Tesla V100 GPU
Dataset |
F |
C |
B |
Gunrock GPU |
OpenMP |
Speedup vs. CPU |
Speedup vs. P100 |
---|---|---|---|---|---|---|---|
flickr |
64 |
10 |
16384 |
39.894 |
1094.377 |
27.43 |
2.90 |
flickr |
64 |
25 |
8192 |
91.810 |
2177.953 |
23.72 |
2.76 |
flickr |
64 |
100 |
1024 |
448.888 |
8641.063 |
19.25 |
2.57 |
flickr |
128 |
10 |
16384 |
65.131 |
1849.658 |
28.40 |
2.95 |
flickr |
128 |
25 |
4096 |
156.965 |
3981.435 |
25.37 |
2.84 |
flickr |
128 |
100 |
512 |
802.966 |
16471.338 |
20.51 |
2.70 |
amazon |
64 |
10 |
16384 |
196.079 |
6142.780 |
31.33 |
2.95 |
amazon |
64 |
25 |
8192 |
423.145 |
12674.514 |
29.95 |
2.92 |
amazon |
64 |
100 |
2048 |
1963.691 |
49205.359 |
25.06 |
2.79 |
amazon |
128 |
10 |
16384 |
324.067 |
9978.198 |
30.79 |
3.10 |
amazon |
128 |
25 |
8192 |
733.181 |
21726.697 |
29.63 |
2.99 |
amazon |
128 |
100 |
2048 |
3361.894 |
86967.164 |
25.87 |
3.01 |
pokec |
64 |
10 |
16384 |
602.116 |
17111.664 |
28.42 |
2.84 |
pokec |
64 |
25 |
8192 |
1379.450 |
35887.129 |
26.02 |
2.71 |
pokec |
64 |
100 |
1024 |
6793.011 |
140751.234 |
20.72 |
2.54 |
pokec |
128 |
10 |
16384 |
1000.310 |
28987.953 |
28.98 |
2.96 |
pokec |
128 |
25 |
4096 |
2366.202 |
63863.352 |
26.99 |
2.82 |
pokec |
128 |
100 |
512 |
11859.789 |
265325.281 |
22.37 |
2.64 |
cit-Patents |
64 |
10 |
16384 |
1374.782 |
38803.195 |
28.22 |
2.95 |
cit-Patents |
64 |
25 |
8192 |
3006.717 |
78112.516 |
25.98 |
2.87 |
cit-Patents |
64 |
100 |
2048 |
13910.529 |
291278.125 |
20.94 |
2.78 |
cit-Patents |
128 |
10 |
16384 |
2276.261 |
65735.234 |
28.88 |
2.99 |
cit-Patents |
128 |
25 |
8192 |
5201.184 |
141957.922 |
27.29 |
2.96 |
cit-Patents |
128 |
100 |
2048 |
23922.527 |
560321.438 |
23.42 |
2.94 |
64 |
10 |
16384 |
2575.303 |
72372.484 |
28.10 |
2.91 |
|
64 |
25 |
8192 |
5658.345 |
148938.422 |
26.32 |
2.84 |
|
64 |
100 |
2048 |
26468.915 |
569925.500 |
21.53 |
2.77 |
|
128 |
10 |
16384 |
4270.454 |
125431.766 |
29.37 |
2.98 |
|
128 |
25 |
8192 |
9744.686 |
267841.563 |
27.49 |
2.93 |
|
128 |
100 |
2048 |
45604.444 |
1098039.375 |
24.08 |
2.93 |
|
europe_osm |
64 |
10 |
16384 |
18440.253 |
497153.969 |
26.96 |
2.90 |
europe_osm |
64 |
25 |
16384 |
39883.421 |
1008675.500 |
25.29 |
2.85 |
europe_osm |
64 |
100 |
4096 |
184128.123 |
3825227.500 |
20.77 |
2.77 |
Implementation limitations#
Memory usage Gunrock’s GPU implementation uses
B x (C x (Wf2.x + F + 1) + 2Wf2.x + 2R.x) x 4
bytes in additional to the features that takes upVF x 4
bytes and the graph itself. Because the batch size B can be adjusted, the main memory consumption is from the feature array. For a P100 with 16 GB memory, if the feature length is 64, the maximum number of vertices it can handle before hitting OOM is about 60 million. The largest dataset tested so far is theeurope_osm
dataset with 50.9M vertices and 108M edges.Data types The vertex Ids and edge Ids are both presented as 32-bit unsigned integers. Input features, weights, output embeddings and intermedia results are represented as 32-bit floating-point numbers. A not so recent trend in machine learning research is to use less precision in neural networks. Half precision / 16-bit floating points are quite common, and supported by recent GPUs. Using half precision cuts the computation time to about half, as compared to single precision, and also cuts the memory usage to store the features in half. It would be interesting to see what would happen if the data type is changed to half precision.
Graph types The training process requires the graph to be undirected (enforced by
--undirected
in Gunrock’s command line parameters). The behavior of sampling a zero-length neighbor list is undefined, and currently it will return the source vertex itself.
Comparison against existing implementations#
Because computation on each vertex is independent from other vertices, GraphSAGE is an
embarrassingly parallel problem when parallelized across vertices. Running a simple
test with the pokec
dataset, feature length as 64, num_children_per_source and
num_leafs_per_child both at 10, the serial run (with omp-threads forced to 1)
on the DGX-1 takes 366390.250 ms, as compared to 25242.941 ms using 32 threads;
using 32 threads is about 14.5X faster than a single thread, which shows GraphSAGE
scales pretty well on the CPU.
Comparing the running time of a 32-thread OpenMP and Gunrock’s GPU implementation, the P100 is about 7.5X to 15X faster. Increasing the feature length from 64 to 128 roughly doubles the computation workload, and the speedup does not change very much for datasets with more than about 1M vertices. However, increasing the number of children per source and the number of leafs per child decreases the speedup. The OMP’s running time increases slower than the number of neighbors selected, while the GPU’s running time increases faster than the number of neighbors selected. This may be attributed to the cache effect: the parallelism on CPU is limited, so data reuse rate is high, even when the number of neighbors increases; however, the number of source or children processed on the GPU at the same time is much larger, with a much bigger working set, and this decreases the cache hit rate, so results in longer running time.
Also interesting is comparing the runtime on V100 and P100: V100 is about 3X faster than P100 when running GraphSAGE. This large performance difference is caused by the different limiting factors when running GraphSAGE on these two GPUs; details are in the the Performance limitations section. Compared to OpenMP, the V100 is about 20X to 30X faster.
Performance limitations#
We profiled GraphSAGE with the pokec
dataset on a Titan Xp GPU (profiling on P100
caused an internal error in the profiler itself; Titan Xp has roughly the same SM
design as P100, but has only 30 SMs vs. 56 on the P100; P100 also has 16 GB
HMB2 memory, and Titan Xp only has 12 GB GDDR5X; runtime on Titan Xp and
P100 is similar). kernel2 takes up about 60% of
the computation of an batch, 10.64 ms out of 17.76 ms for pokec
with 64 feature
length, 10 neighbors, and batch size at 16384. Kernel1 takes 1.56 ms, or 8.8%,
and Kernel3 takes 5.52 ms, or 31.1%.
Further detail on the profile of Kernel2 on the Titan Xp shows the utilization of the memory system:
Type |
Transactions |
Bandwidth |
Utilization |
---|---|---|---|
Shared Loads |
2129920 |
31.997 GB/s |
|
Shared Stores |
1638400 |
24.613 GB/s |
|
Shared Total |
3768320 |
56.611 GB/s |
Idle to low |
Local Loads |
0 |
0 GB/s |
|
Local Stores |
0 |
0 GB/s |
|
Global Loads |
2685009922 |
1575.871 GB/s |
|
Global Stores |
0 |
0 GB/s |
|
Texture Reads |
671252480 |
2521.025 GB/s |
|
Unified Total |
3356262402 |
4096.897 GB/s |
High |
L2 Reads |
264170192 |
992.145 GB/s |
|
L2 Writes |
10485773 |
39.381 GB/s |
|
L2 Total |
274655965 |
1031.526 GB/s |
Low to medium |
Device Reads |
2181833 |
8.194 GB/s |
|
Device Writes |
543669 |
2.042 GB/s |
|
Device Total |
2725502 |
10.236 GB/s |
Idle to low |
It’s clear that the unified cache is almost fully utilized, at 4 TBps out of
the 5 TBps theoretical upper bound, and is the bottleneck.
This is because the W
arrays and the intermediate arrays are highly reusable.
Running the same experiment on the V100 shows a different picture within the memory system:
Type |
Transactions |
Bandwidth |
Utilization |
---|---|---|---|
Shared Loads |
2138599 |
74.166 GB/s |
|
Shared Stores |
1640842 |
56.904 GB/s |
|
Shared Total |
3779441 |
131.070 GB/s |
Idle to low |
Local Loads |
0 |
0 GB/s |
|
Local Stores |
0 |
0 GB/s |
|
Global Loads |
419594240 |
3637.862 GB/s |
|
Global Stores |
0 |
0 GB/s |
|
Texture Reads |
177448960 |
6153.897 GB/s |
|
Unified Total |
597043200 |
9791.759 GB/s |
Medium |
L2 Reads |
8209326 |
71.174 GB/s |
|
L2 Writes |
5243176 |
45.458 GB/s |
|
L2 Total |
274655965 |
116.633 GB/s |
Idle to low |
Device Reads |
2402833 |
20.832 GB/s |
|
Device Writes |
571925 |
4.959 GB/s |
|
Device Total |
2974758 |
25.791 GB/s |
Idle to low |
On the V100, kernel2 only takes 3.982 ms (about 60% of 6.67 ms per batch), and the unified-cache throughput increases to 9.8 TBps, more than double than on the Titan Xp. In fact, the theoretical upper bound of V100’s L1 throughput is 28 TBps, resulted from doubling each SM’s load throughput from 128 bytes per cycle to 256 bytes per cycle, and increasing the SM count to 80. This is the reason why the V100 can outperform P100 and Titan Xp by about 3X. The performance bottleneck is no longer the memory system, and switches to integer computations, which comes mainly from array index calculation. This particular kernel also takes up 32 registers per thread, which is the limit for full GPU occupancy. Storing intermediate index calculation results would help if the register usage is not so high.
Next Steps#
Alternate approaches#
Things we tried that didn’t really work
An simple implementation that uses a thread to process the per-source or per-child computation is coded, but it runs about 10X slower than the current implementation that uses a block to process such units. One reason is the use of block-level parallel primitives, such as block reduce. Another reason is that by using a whole block, instead of a thread, to process the same computation, the working set is greatly reduced together with the parallelism, and the whole working set can fit into the cache system. When using higher parallelism, the working set is larger, and forced to be evicted into the global memory, and creates a bottleneck. Actually during the experiment, for the thread-level parallelism implementation, reducing the number of blocks of the kernel improves its running time, the opposite to normally what would be expected from running other kernels.
Gunrock implications#
One thing that Gunrock does not provide, or intentionally hides, is block-level parallelism. However, it comes in handy when implementing the custom kernels for GraphSAGE: each block can process a one-dimension vector, with each thread holding one element, then use a block-level reduce to get the sum of those elements; this way is highly efficient, and actually reduces the parallelism and with it the size of the working set of data.
Notes on multi-GPU parallelization#
The main memory usage is due to feature data, so it is critical to not duplicate features. The side effect is the computation needs to be divided into child-centric and source-centric parts, and exchange data in between. It should be scalable, and easy to implement.
Notes on dynamic graphs#
GraphSAGE does not have a dynamic graph component, but it should able to work on a dynamic graph. Some of the data may be reusable if the graph has not been significantly changed, but the resulting memory requirement to store the intermediate data may make data reuse impossible.
Notes on larger datasets#
If the dataset is so large that the graph and the per-vertex feature data are larger than the combined GPU memory, it’s possible to accumulate the features of leafs and children on CPU, transfer the data on to GPU, and perform the computation. Because of the cost of the transfer, runtime will increase significantly as compared to the cases when the full data can fit in the GPU memory, but whether that will cause an increase above the OpenMP implementation is still unknown.
Notes on other pieces of this workload#
The main part of GraphSAGE workflow is actually the training process, which will be outside of Gunrock, provided by TensorFlow, PyTorch or other machine learning libraries. How to connect the training with the Gunrock GPU implementation is the main task for this workload going forward.
Research potential#
The GPU implementation of the embedding part runs a lot faster than on the CPU, and hits a few GPU hardware limitations. It should have comparable runtime to other GPU implementations. But it’s only useful as a part of the whole workload and achieves comparable prediction accuracy as conventional / reference implementations.
An interesting question raised from the GraphSAGE GPU implementation is whether it is useful to expose block-level parallelism to higher-level programming models/APIs, and if so, how to do that. It’s clear that working at the block level provides benefits, such as the ability to use block-level primitives like scan and reduce. But it also comes with costs, most importantly, requiring the programmer to have knowledge about the GPU hardware. It also reduces the portability of the implementation, because two-level parallelism may not exist on other processors.