Geolocation#
Infers user locations using the location (latitude, longitude) of friends through spatial label propagation. Given a graph G
, geolocation examines each vertex v
’s neighbors and computes the spatial median of the neighbors’ location list. The output is a list of predicted locations for all vertices with unknown locations.
Summary of Results#
Geolocation or geotagging is an interesting parallel problem, because it is among the few that exhibits the dynamic parallelism pattern within the compute. The pattern is as follows; there is parallel compute across nodes, each node has some serial work and within the serial work there are several parallel math operations. Even without leveraging dynamic parallelism within CUDA (kernel launches within a kernel), Geolocation performs well on the GPU environment because it mainly requires simple math operations, instead of complicated memory movement schemes.
However, the challenge within the application is load balancing this simple compute, such that each processor has roughly the same amount of work. Currently, in Gunrock, we map Geolocation using the ForAll()
compute operator with optimizations to exit early (performing less work and fewer reads). Even without addressing load balancing issue with a complicated balancing scheme, on the HIVE datasets we achieve a 100x speedup with respect to the CPU reference code, implemented using C++ and OpenMP, and ~533x speedup with respect to the GTUSC implementation. We improve upon the algorithm by avoiding a global gather and a global synchronize, and using 3x less memory than the GTUSC reference implementation.
Summary of Gunrock Implementation#
There are two approaches we took to implement Geolocation within Gunrock:
[Fewer Reads] Global Gather: uses two
compute
operators asForAll()
. The firstForAll()
is agather
operation, gathering all the values of neighbors with known locations for an active vertexv
, and the secondForAll()
uses those values to compute thespatial_center
where the spatial center of a list’s points is the center of those points on the earth’s surface.
def gather_op(Vertex v):
for neighbor in G.neighbors(v):
if isValid(neighbor.location):
locations_list[v].append(neighbor.location)
def compute_op(Vertex v):
if !isValid(v.location):
v.location = spatial_center(locations_list[v])
[Less Memory] Repeated Compute: skips the global gather and uses only one
compute
operator as aForAll()
to find the spatial center of every vertex. During the spatial center computation, instead of iterating over all valid neighbors (where valid neighbor is a neighbor with a known location), we iterate over all neighbors for each vertex, doing more random reads than the global gather approach, but using3x
less memory.
def spatial_center(Vertex v):
if !isValid(v.location):
v.location = spatial_median(neighbors_list[v])
[Optimization] Early Exit: fuses the global gather approach with the repeated compute, by performing one local gather for every vertex within the spatial center operator (without a costly device barrier), and exiting early if a vertex
v
has only one or two valid neighbors:
def spatial_center(Vertex v):
if !isValid(v.location):
if v.valid_locations == 1:
v.location = valid_neighbor[v].location:
exit
else if v.valid_locations == 2:
v.location = mid_point(valid_neighbors[v].location)
else:
v.location = spatial_median(neighbors_list[v])
Comparing Global Gather vs. Repeated Compute#
Approach |
Memory Usage |
Memory Reads/Vertex |
Device Barriers |
Largest Dataset (P100) |
---|---|---|---|---|
Global Gather |
$O(3 \cdot \cardinality{E})$ |
# of valid locations |
1 |
~160M Edges |
Repeated Compute |
$O(\cardinality{E})$ |
degree of vertex |
0 |
~500M Edges |
Note: spatial_median()
is defined as center of points on earth’s surface – given a set of points Q
, the function computes the point p
such that: sum([haversine_distance(p, q) for q in Q])
is minimized. See gunrock/app/geo/geo_spatial.cuh
for details on the spatial median implementation.
How To Run This Application on DARPA’s DGX-1#
Prerequisites#
git clone --recursive https://github.com/gunrock/gunrock -b dev-refactor
cd gunrock/tests/geo/
make clean && make
HIVE Data Preparation#
Prepare the data, skip this step if you are just running the sample dataset. Assuming we are in tests/geo
directory:
export TOKEN= # get this Authentication TOKEN from
# https://api-token.hiveprogram.com/#!/user
wget --header "Authorization:$TOKEN" \
https://hiveprogram.com/data/_v0/geotagging/instagram/instagram.tar.gz
tar -xzvf instagram.tar.gz && rm instagram.tar.gz
cd instagram/graph
cp ../../generate-data.py ./
python generate-data.py
This will generate two files, instagram.mtx
and instagram.labels
, which can be used as an input to the geolocation app.
Running the application#
Application specific parameters:
--labels-file
file name containing node ids and their locations.
--geo-iter
number of iterations to run geolocation or (stop condition).
(default = 3)
--spatial-iter
number of iterations for spatial median computation.
(default = 1000)
--geo-complete
runs geolocation for as many iterations as required
to find locations for all nodes.
(default = false because it uses atomics)
--debug
Debug label values, this prints out the entire labels
array (longitude, latitude).
(default = false)
Example command-line:
# geolocation.mtx is a graph based on chesapeake.mtx dataset
./bin/test_geo_10.0_x86_64 --graph-type=market --graph-file=./geolocation.mtx \
--labels-file=./locations.labels --geo-iter=2 --geo-complete=false
Sample input (labels):
% Nodes Latitude Longitude
39 2 2
1 37.7449063493 -122.009432884
2 37.8668048274 -122.257973253
4 37.869112506 -122.25910604
6 37.6431858915 -121.816156983
11 37.8652346572 -122.250634008
19 38.2043433677 -114.300341275
21 36.7582225593 -118.167916598
22 33.9774659389 -114.886512278
30 39.2598884729 -106.804662071
31 37.880443573 -122.230147039
39 9.4276164485 -110.640705659
Sample output:
Loading Matrix-market coordinate-formatted graph ...
Reading from ./geolocation.mtx:
Parsing MARKET COO format edge-value-seed = 1539674096
(39 nodes, 340 directed edges)...
Done parsing (0 s).
Converting 39 vertices, 340 directed edges ( ordered tuples) to CSR format...
Done converting (0s).
Labels File Input: ./locations.labels
Loading Labels into an array ...
Reading from ./locations.labels:
Parsing LABELS
(39 nodes)
Done parsing (0 s).
Debugging Labels -------------
(nans represent unknown locations)
locations[ 0 ] = < 37.744907 , -122.009430 >
locations[ 1 ] = < 37.866806 , -122.257973 >
locations[ 2 ] = < nan , nan >
locations[ 3 ] = < 37.869114 , -122.259109 >
...
locations[ 35 ] = < nan , nan >
locations[ 36 ] = < nan , nan >
locations[ 37 ] = < nan , nan >
locations[ 38 ] = < 9.427616 , -110.640709 >
__________________________
______ CPU Reference _____
--------------------------
Elapsed: 0.267029
Initializing problem ...
Number of nodes for allocation: 39
Initializing enactor ...
Using advance mode LB
Using filter mode CULL
nodes=39
__________________________
0 0 0 queue3 oversize : 234 -> 342
0 0 0 queue3 oversize : 234 -> 342
--------------------------
Run 0 elapsed: 11.322021, #iterations = 2
Node [ 0 ]: Predicted = < 37.744907 , -122.009430 > Reference = < 37.744907 , -122.009430 >
Node [ 1 ]: Predicted = < 37.866806 , -122.257973 > Reference = < 37.866806 , -122.257973 >
Node [ 2 ]: Predicted = < 9.427616 , -110.640709 > Reference = < 9.427616 , -110.640709 >
Node [ 3 ]: Predicted = < 37.869114 , -122.259109 > Reference = < 37.869114 , -122.259109 >
...
Node [ 35 ]: Predicted = < 37.864429 , -122.199409 > Reference = < 37.864429 , -122.199409 >
Node [ 36 ]: Predicted = < 23.755602 , -115.803055 > Reference = < 37.807079 , -122.134163 >
Node [ 37 ]: Predicted = < 37.053715 , -115.913658 > Reference = < 37.053719 , -115.913628 >
Node [ 38 ]: Predicted = < 9.427616 , -110.640709 > Reference = < 9.427616 , -110.640709 >
0 errors occurred.
[geolocation] finished.
avg. elapsed: 11.322021 ms
iterations: 2
min. elapsed: 11.322021 ms
max. elapsed: 11.322021 ms
load time: 68.671 ms
preprocess time: 496.136000 ms
postprocess time: 0.463009 ms
total time: 508.110046 ms
Output#
When quick
mode is disabled, the application performs the CPU reference implementation, which is used to validate the results from the GPU implementation by comparing the predicted latitudes and longitudes of each vertex with the CPU reference implementation. Further correctness checking was performed by comparing results to the HIVE reference implementation.
Geolocation application also supports the quiet
mode, which allows the user to skip the output and just report the performance metrics (note, this will run the CPU implementation in the background without any output).
Performance and Analysis#
Runtime is the key metric for measuring performance for Geolocation. We also check for prediction accuracy of the labels, but that is a threshold for correctness. If a certain threshold is not met (while comparing results to the CPU reference code), the output is considered incorrect and that run is invalid. Therefore, for the report we just focus on runtime.
Implementation limitations#
Geolocation is also one of the few applications that exhibits a dynamic parallelism pattern:
Parallel compute across the nodes,
Serial compute per node, and
Parallel compute within the serial compute per node.
One way to implement this will use the ForAll()
operator for the parallel compute across the nodes, a simple while loop for the serial compute per node, and finally multiple neighbor_reduce()
operators for the parallel work within the serial while loop. Currently, we do not have a way to support this within Gunrock, but moving forward we can potentially leverage kernel launch within a kernel (“dynamic parallelism”) to address this limitation.
Comparison against existing implementations#
GPU |
Dataset |
$\cardinality{V}$ |
$\cardinality{E}$ |
Iters |
Spatial Iters |
GTUSC (16 threads) |
Gunrock (CPU) |
Gunrock (GPU) |
---|---|---|---|---|---|---|---|---|
P100 |
sample |
39 |
170 |
10 |
1000 |
N/A |
0.144005 |
0.022888 |
P100 |
23731995 |
82711740 |
10 |
1000 |
8009.491 ms |
1589.884033 |
15.113831 |
|
V100 |
50190344 |
488078602 |
10 |
1000 |
N/A |
9216.666016 |
46.108007 |
On a workload that fills the GPU, Gunrock outperforms GT’s OpenMP C++ implementation by ~533x. Comparing Gunrock’s GPU vs. CPU performance, we see that Gunrock’s GPU version outperforms the CPU implementation by 100x. There is a lack of available datasets against which we can compare performance, so we use only the provided instagram and twitter datasets, and a toy sample for a sanity check on NVIDIA’s P100 with 16GB of global memory and V100 with 32GB of global memory. All tested implementations meet the criteria of accuracy, which is validated against the output of the original python implementation.
HIVE reference implementation uses distributed PySpark.
GTUSC implementation uses C++ OpenMP.
Performance limitations#
As discussed later in the “Alternate approaches” section, the current implementation of geolocation uses a compute operator with minimal load balancing. In cases where the graph is not so nicely distributed (where there is a great deal of difference in the degrees of vertices), the entire application will suffer significantly from load imbalance.
Profiling the application shows 98.78% of the compute time in GPU activities is in the spatial_median
kernel, which gives us a good direction to focus our efforts on load-balancing the workloads within the operator. Specifically, we must target the for
loops iterating over the neighbor list for spatial center calculations.
Next Steps#
Alternate approaches#
Neighborhood Reduce w/ Spatial Center: We can perform better load balancing by leveraging a neighbor-reduce (
advance
operator +cub::DeviceSegmentedReduce
) instead of using a compute operator. In graphs where the degrees of nodes vary a lot, the compute operator will be significantly slower than a load-balanced advance + segmented reduce.Push Based Approach: Instead of gathering all the locations from all the neighbors of an active vertex, we could instead perform a scatter of valid locations of all active vertices to their neighbors; this is a push approach vs. our current implementation’s pull. Similar to the global gather approach, a push-based geolocation could also suffer from load imbalance, where some vertices will have to broadcast their valid locations to a long list of neighbors, while others will only have few neighbors to update. A push-based approach will also require a device synchronize before the spatial center computation, but may perform better by using an
advance_op
with an atomic update (note, pull is done using aForAll()
).
Gunrock implications#
The
predicted
atomic: Geolocation and some other applications exhibit the same behavior where the algorithm stops when all vertices’ labels are predicted or determined. In Geolocation’s case, when a location for all nodes is predicted, geolocation converges. We currently implement this with a loop and an atomic. This needs to be more of a core operation (mini-operator) such that whenisValidCount(labels|V|) == |V|
, a stop condition is met. Currently, we sidestep this issue by using a number-of-iterations parameter to determine the stop condition.Parallel -> Serial -> Parallel: As discussed earlier, Gunrock currently doesn’t have a way to address the dynamic parallelism problem, or even a kernel launch within a kernel. In geolocation’s case, these minor parallel work inside the serial loop need to be multiple neighbor reduce.
Notes on multi-GPU parallelization#
The challenging part for a multi-GPU Geolocation would be to obtain the updated node location from a separate device if the two vertices on different devices share an edge. An interesting approach here would be leveraging the P2P memory bandwidth with the new NVLink connectors to exchange a small amount of updates across the NVLink’s memory lane; other ways are simply using direct accesses or explicit data movement. This is detailed more in the scaling documentation, but the communication model for multi-GPU geolocation could be done in the following way:
do
Local geo location updates on local vertices;
Broadcast local vertices' updates;
while no more update.
Notes on dynamic graphs#
Streaming graphs is an interesting problem for the Geolocation application, because when predicting the location of a certain node, if another edge is introduced, the location of the vertex has to be recomputed entirely. This can still be done in an iterative manner, where if a node was inserted as a neighbor to a vertex, that vertex’s predicted location will be marked invalid and during the next iteration it will be computed again along with all the other invalid vertices (locations).
Notes on larger datasets#
If the datasets are larger than a single or multi-GPU’s aggregate memory, the straightforward solution would be to let Unified Virtual Memory (UVM) in CUDA automatically handle memory movement.
Notes on other pieces of this workload#
Geolocation calls a lot of CUDA math functions (sin
, cos
, atan
, atan2
, median
, mean
, fminf
, fmaxf
, etc.). Some of these micro-workloads can also leverage the GPU’s parallelism; for example, a mean could be implemented using reduce-mean/sum
. We currently don’t have these math operators exposed within Gunrock in such a way they can be used in graph applications.
Research Potential#
Further research is required to study Geolocation’s dynamic parallelism pattern, it’s memory access behavior, compute resource utilization, implementation details (API and core) and load balancing strategies for dynamic parallelism on the GPUs. Studying and understanding this pattern can allow us to create a more generalized approach for load balancing parallel -> serial -> parallel
type of problems. It further invokes the question of studying when dynamic parallelism is better than mapping an algorithm to a more conventional static approach (if possible).