Community Detection (Louvain)#

Community detection in graphs means grouping vertices together, so that those vertices that are closer (have more connections) to each other are placed in the same cluster. A commonly used algorithm for community detection is Louvain (https://arxiv.org/pdf/0803.0476.pdf).

Summary of Results#

The Gunrock implementation uses sort and segmented reduce to implement the Louvain algorithm, different from the commonly used hash table mapping. The GPU implementation is about ~1.5X faster than the OpenMP implementation, and also faster than previous GPU works. It is still unknown whether the sort and segmented reduce formulation map the problem better than hash table on the GPU. The modularities resulting from the GPU implementation are within small differences as the serial implementation, and are better when the graph is larger. A custom hash table can potentially improve the running time. The GPU Louvain implementation should have moderate scalability across multiple GPUs in an DGX-1.

Summary of Gunrock Implementation#

The commonly used approach to implement the Louvain algorithm uses a hash table. However, the memory access pattern that results from a hash table is almost totally random, and not GPU-friendly (more in the Alternative approaches section). Instead of using a hash table to accumulate the values associated with the same key, the Gunrock implementation on GPU tries another method: sort all key-value pairs, and use segmented reduce to accumulate the values in the continuous segments. Because Louvain always visits all edges in the graph, there is no need to use Gunrock frontiers, and the advance operator with the ALL_EDGES advance mode or a simple ForAll loop should be sufficient. The pseudocode is listed below:

m2 <- sum(edge_weights);
//Outer-loop
Do
    // Pass-initialization, assign each vertex to its own community
    For each vertex v in graph:
        current_communities[v] <- v;
        weights_v2any      [v] <- 0;
        weights_v2self     [v] <- 0;
    For each edge e<v, u> in graph: //an advance operator on all edges
        weights_v2any[v] += edge_weights[e];
        if (v == u)
            weights_v2self[v] += edge_weights[e];
    For each vertex v in graph:
        weights_community2any[v] <- weights_v2any[v];
    pass_gain <- 0;

    // Modularity optimizing iterations
    Do
        // Get weights between vertex and community
        For each edge e<v, u> in graph:
            edge_pairs  [e] <- <v, current_communities[u]>;
            pair_weights[e] <- edge_weights[e];
        Sort(edge_pairs, pair_weights)
            by edge_pair.first, then edge_pair.second if tie;
        segment_offsets <- offsets of continuous edge_pairs;
        segment_weights <- SegmentedReduce(pair_weights, segment_offsets, sum);

        // Compute base modularity gains
        // if moving vertices out of current communities
        For each vertex v in graph:
            comm <- current_communities[v];
            w_v2comm <- Find weights from v to comm in segment_weights;
            gain_bases[v] = weights_v2self[v] - w_v2comm
                - (weights_v2any[v] - weights_community2any[comm])
                    * weights_v2any[v] / m2;

        // Find the max gains if moving vertices into adjacent communities
        For each vertex v in graph:
            gains[v] <- 0;
            next_communities[v] <- current_communities[v];
        For each seg<v, comm> segment:
            if (comm == current_communities[v])
                continue;
            gain <- gain_bases[v] + segment_weights[seg]
                - weights_community2any[comm] * weights_v2any[v] / m2;
            atomicMax(max_gains + v, gain);
            seg_gains[seg] <- gain;
        For each seg<v, comm> segment:
            if (seg_gains[seg] != max_gains[v])
                continue;
            next_communities[v] <- comm;

        // Update communities
        For each vertex v in graph:
            curr_comm <- current_communities[v];
            next_comm <- next_communities[v];
            if (curr_comm == next_comm)
                continue;

            atomicAdd(weights_community2any[next_comm],  weights_v2any[v]);
            atomicAdd(weights_community2any[curr_comm], -weights_v2any[v]);
            current_communities[v] <- next_comm;

        iteration_gain <- Reduce(max_gains, sum);
        pass_gain += iteration_gain;
    While iterations stop condition not met
    // End of modularity optimizing iterations

    // Contract the graph
    // renumber occupied communities
    new_communities <- Renumber(current_communities);
    For each edge e<v, u> in graph: //an advance operator on all edges
        edge_pairs[e] <- <new_communities[current_communities[v]],
                          new_communities[current_communities[u]]>
    Sort(edge_pairs, edge_weights) by pair.x, then pair.y if tie;
    segment_offsets <- offsets of continuous edge_pairs;
    new_graph.Allocate(|new_communities|, |segments|);
    new_graph.edges <- first pair of each segments;
    new_graph.edge_values <- SegmentedReduce(edge_weights, segment_offsets, sum);

While pass stop condition not met

How To Run This Application on DARPA’s DGX-1#

Prereqs/input#

CUDA should have been installed; $PATH and $LD_LIBRARY_PATH should have been set correctly to use CUDA. The current Gunrock configuration assumes boost (1.58.0 or 1.59.0) and Metis are installed; if not, changes need to be made in the Makefiles. DARPA’s DGX-1 has both installed when the tests are performed.

git clone --recursive https://github.com/gunrock/gunrock/
cd gunrock
git checkout dev-refactor
git submodule init
git submodule update
mkdir build
cd build
cmake ..
cd ../tests/louvain
make

At this point, there should be an executable louvain_main_<CUDA version>_x86_64 in tests/louvain/bin.

The datasets are assumed to have been placed in /raid/data/hive, and converted to proper matrix market format (.mtx). At the time of testing, ca, amazon, akamai, and pokec are available in that directory. ca and amazon are taken from PNNL’s implementation, and originally use 0-based vertex indices; 1 is added to each vertex id to make them proper .mtx files.

The testing is done with Gunrock using the dev-refactor branch at commit 2699252 (Oct. 18, 2018), using CUDA 9.1 with NVIDIA driver 390.30.

Running the application#

./bin/louvain_main_9.1_x86_64 --omp-threads=32 --iter-stats --pass-stats \
--advance-mode=ALL_EDGES --unify-segments=true --validation=each --num-runs=10 \
--graph-type=market --graph-file=/raid/data/hive/[DataSet]/[DataSet].mtx \
--jsondir=[LogDir] > [LogDir]/[DataSet].txt 2>&1

Add --undirected, if the graph is indeed undirected.
Remove --iter-stats or --pass-stats, if detailed timings are not required.
Remove --validation=each, to only compute the modularity for the last run.

For example, when DataSet = akamai, and LogDir = eval/DGX1-P100x1, the command is

./bin/louvain_main_9.1_x86_64 --omp-threads=32 --iter-stats --pass-stats \
--advance-mode=ALL_EDGES --unify-segments=true --validation=each --num-runs=10 \
--graph-type=market --graph-file=/raid/data/hive/akamai/akamai.mtx \
--jsondir=eval/DGX1-P100x1 > eval/DGX1-P100x1/akamai.txt 2>&1

Output#

The outputs are in the louvain directory. Look for the .txt files: running time is after Run x elapsed:, and the number of communities and the resulting modularity is in the line that starts with Computed:. There are 12 runs in each .txt file: 1 single thread CPU run for reference, 1 OpenMP multiple-thread (32 threads in the example, may not be optimal) run, and 10 GPU runs.

The output was compared against PNNL’s results on the number of communities and modularity for the amazon and ca datasets. Note that PNNL’s code does not count dangling vertices in communities. The results listed below use the number of communities minus dangling vertices; the dataset details are can be found in the next section.

The modularity of resulting communities:

DataSet	Gunrock GPU	OMP (32T)	Serial	PNNL (8T)	PNNL (serial)
amazon	0.908073	0.925721	0.926442	0.923728	0.925557
ca	0.711971	0.730217	0.731292	0.713885	0.727127

Note for these kind of small graphs, more parallelism could hurt the resulting modularity. Multi-thread CPU implementations from both Gunrock and PNNL yield modularities a little less than the serial versions, and the GPU implementation sees a ~0.02 drop. The reason could be concurrent updates to communities: vertex A moves to community C, thinking vertex B is in C; but B may have simultaneously moved to other communities.

However, when input graphs are larger in size, this issue seems to disappear, and modularities from the GPU implementation are sometimes even better than the serial implementation. (See detailed results below.)

The number of resulting communities:

DataSet	Gunrock GPU	OMP (32T)	Serial	PNNL (8T)	PNNL (serial)
amazon	7667	213	240	298	251
ca	1120	616	617	654	623

More parallelism also affects the number of resulting communities. On these two small datasets, the GPU implementation produces significantly more communities than all CPU implementations; on large datasets, the differences in the number of communities are much smaller. We believe the reason may also be concurrent community updates, especially when whole-community migration happens: all vertices in community A decide to move to community B, and all vertices in community B decide to move to community A. In the serial implementation, we may see these two communities combine into a single community, but on the GPU, we instead see that community A and B just swap their labels and never become a single community.

Performance and Analysis#

We measure Louvain performance with three metrics: the number of resulting communities (#Comm), the modularity of resulting communities (Q), and the running time (Time, in seconds). Higher Q and lower running time are better. * indicates the graph is given as undirected, and the number of edges is counted after edge doubling and removing of self loops or duplicate edges; otherwise, the graph is taken as directed, and self loops or duplicate edges are also removed. If edge weights are available in the input graph, they follow the input; otherwise, the initial edge weights are set to 1.

Details of the datasets:

DataSet	#V	#E	#dangling vertices
ca	108299	186878	85166
preferentialAttachment	100000	999970*	0
caidaRouterLevel	192244	1218132*	0
amazon	548551	1851744	213688
coAuthorsDBLP	299067	1955352*	0
webbase-1M	1000005	2105531	2453
citationCiteseer	268495	2313294*	0
cnr-2000	325557	3128710	0
as-Skitter	1696415	22190596*	0
coPapersDBLP	540486	30481458*	0
pokec	1632803	30622564	0
coPapersCiteseer	434102	32073440*	0
akamai	16956250	53300364	0
soc-LiveJournal1	4847571	68475391	962
channel-500x100x100-b050	4802000	85362744	0
europe_osm	50912018	108109320*	0
hollywood-2009	11399905	112751422*	32662
rgg_n_2_24_s0	16777216	265114400*	1
uk-2002	18520486	292243663	37300

Running time in seconds:

GPU	Dataset	Gunrock GPU	Speedup vs. OMP	OMP	Serial
P100	ca	0.108	0.24	0.026	0.065
V100	ca	0.089	0.33	0.029	0.067
V100	preferentialAttachment	0.076	1.26	0.096	0.235
V100	caidaRouterLevel	0.063	1.03	0.065	0.229
P100	amazon	0.160	1.27	0.203	0.648
V100	amazon	0.122	1.62	0.198	0.631
V100	coAuthorsDBLP	0.082	1.62	0.133	0.414
V100	webbase-1M	0.107	1.57	0.168	0.318
V100	citationCiteseer	0.074	1.50	0.111	0.432
V100	cnr-2000	0.235	0.57	0.133	0.388
V100	as-Skitter	0.376	1.76	0.660	2.480
V100	coPapersDBLP	0.358	1.22	0.437	1.860
P100	pokec	0.929	1.34	1.244	6.521
V100	pokec	0.624	1.74	1.083	6.110
V100	coPapersCiteseer	0.353	1.11	0.391	1.592
P100	akamai	1.278	5.13	6.560	14.427
V100	akamai	0.934	6.79	6.343	13.266
V100	soc-LiveJournal1	1.548	2.59	4.016	16.311
V100	channel-500x100x100-b050	1.133	0.68	0.768	4.449
V100	europe_osm	4.902	7.11	34.875	101.320
V100	hollywood-2009	1.230	1.40	1.721	9.419
V100	rgg_n_2_24_s0	3.378	0.79	2.664	17.816
V100	uk-2002	4.921	1.15	5.682	31.006

Resulting modularity:

GPU	DataSet	Gunrock GPU	Gunrock - Serial	OMP	OMP - Serial	Serial
P100	ca	0.7112	-0.0193	0.7302	-0.0011	0.7313
V100	ca	0.7166	-0.0147	0.7290	-0.0025	0.7313
V100	preferentialAttachment	0.1758	-0.1095	0.2287	-0.0565	0.2852
V100	caidaRouterLevel	0.8500	+0.0065	0.8362	-0.0073	0.8436
P100	amazon	0.9081	-0.0184	0.9257	-0.0007	0.9264
V100	amazon	0.9089	-0.0175	0.9258	-0.0006	0.9264
V100	coAuthorsDBLP	0.8092	-0.0179	0.8136	-0.0135	0.8271
V100	webbase-1M	0.8945	-0.0613	0.9471	-0.0087	0.9558
V100	citationCiteseer	0.7888	-0.0137	0.7605	-0.0420	0.8025
V100	cnr-2000	0.8766	-0.0031	0.8784	-0.0013	0.8797
V100	as-Skitter	0.8366	+0.0234	0.8223	+0.0091	0.8132
V100	coPapersDBLP	0.8494	+0.0003	0.8440	-0.0051	0.8492
P100	pokec	0.6933	-0.0012	0.6914	-0.0032	0.6945
V100	pokec	0.6741	-0.0204	0.6763	-0.0183	0.6945
V100	coPapersCiteseer	0.9075	-0.0035	0.9059	-0.0051	0.9110
P100	akamai	0.9334	+0.0329	0.9072	+0.0067	0.9005
V100	akamai	0.9333	+0.0328	0.9074	+0.0070	0.9005
V100	soc-LiveJournal1	0.7336	+0.0097	0.5456	-0.1782	0.7239
V100	channel-500x100x100-b050	0.9004	+0.0498	0.9512	+0.1007	0.8505
V100	europe_osm	0.9979	+0.0142	0.9844	+0.0008	0.9836
V100	hollywood-2009	0.7432	-0.0079	0.7502	-0.0010	0.7511
V100	rgg_n_2_24_s0	0.9921	+0.0026	0.9920	+0.0024	0.9896
V100	uk-2002	0.9507	-0.0098	0.9604	+0.0000	0.9604

The number of resulting communities:

GPU	DataSet	Gunrock GPU	OMP	Serial
P100	ca	1120	616	617
V100	ca	1076	615	617
V100	preferentialAttachment	18	14	39
V100	caidaRouterLevel	410	467	745
P100	amazon	7667	213	240
V100	amazon	7671	233	240
V100	coAuthorsDBLP	95	138	273
V100	webbase-1M	4430	1469	1362
V100	citationCiteseer	67	48	141
V100	cnr-2000	65621	59219	59253
V100	as-Skitter	924	1945	2531
V100	coPapersDBLP	70	111	237
P100	pokec	154988	161709	166156
V100	pokec	155100	162464	166156
V100	coPapersCiteseer	108	110	358
P100	akamai	90285	130639	145785
V100	akamai	90245	127843	145785
V100	soc-LiveJournal1	506826	434272	447426
V100	channel-500x100x100-b050	24	54	12
V100	europe_osm	17320	784171	828662
V100	hollywood-2009	12218	12593	12741
V100	rgg_n_2_24_s0	344	359	311
V100	uk-2002	2402560	2245355	2245678

Implementation limitations#

Memory usage Each edge in the graph needs at least 88 bytes of GPU device memory: 32 bits for the destination vertex, 64 bits for edge weight, 64 bits x 2 x 4 for edge pairs and sorting, 32 bits for segment offsets and 64 bits for segment weights (assuming both vertex and edge ids are represented as 32-bit integers and edge weights are represented as 64-bit floating point numbers). So using a 32 GB GPU, the maximum graph the Louvain implementation can process is roughly 300 million edges. The largest graph successfully run so far is uk-2002 with 292M edges and 18.5M vertices.
Data types The edge weights and all computation around them are double-precision floating point values (64-bit double in C/C++); we tried single precision and found it resulted in very poor modularities. The weight * weight / m2 part in the modularity calculation may be the reason for this limitation. Vertex ids should be 32-bit, as the implementation uses 64-bit integers to represent an edge pair for the sort function; if fast GPU sorting is available for 128-bit integers, 64-bit vertex ids might be used.

Comparison against existing implementations#

We compare against serial CPU and OpenMP implementations; both are Gunrock’s CPU reference routines. PNNL’s results and previous published works are also referenced, to make sure the resulting modularities and running times are sane.

Modularity The modularities have some variation from different implementations, mostly within +/- 0.05. On small graphs, the GPU implementation sees some modularity drops; on large graphs, the GPU implementation is more likely to yield modularity at least as good as the serial implementation. The larger the graph is, the better relative modularity is expected from Gunrock, as compared to the serial implementation. As mentioned in the output section, that variation could be caused by concurrent movement of vertices. Small graphs could suffer more than larger graphs, as movements to a community have a higher chance to happen concurrently.
Running time Overall, the Gunrock implementation is 2x to 5x faster than previous work on GPU (Naim 2017, Cheong 2013). Our OMP implementation is a bit faster than PNNL’s, and much faster than previous work using multiple CPU threads (Lu 2015, Naim 2016). The sequential CPU is an order of magnitude faster than previous sequential CPU work (Naim 2017, Naim 2016, Blondel 2008, Cheong 2013, Lu 2015). Comparing across different Gunrock implementations, the GPU is not always the fastest: on small graphs, GPU could actually be slower, caused by GPU kernel overheads and hardware underutilization; so for small graphs, the OpenMP implementation may be a better choice. Gunrock’s GPU implementation in practice runs slower than OpenMP for mesh-like graphs, in which every vertex only has a very small neighbor list; the parallel formulation Gunrock uses does not work well on this kind of graph.

Published results (timing and modularity) from previous work are summarized in the louvain_results.xlsx file in the louvain directory.

Performance limitations#

The GPU performance bottleneck is the sort function, especially in the first pass. It’s true that sort on GPU is much faster than CPU; but the CPU implementations use hash tables, which may not be suitable for the GPU; the alternative approaches section has more details on this.

Using the akamai dataset to profile the GPU Louvain implementation on a V100, an iteration in the first pass takes 64.08 ms, and the sort takes 42.05 ms, which is two-thirds of the iteration time. The Louvain implementation uses CUB’s radix-sort-pair function. CUB is considered to be one of the GPU primitive libraries that provide the best performance. Further profiling shows during the sort kernel, the device memory controller is ~75% utilized; in other words, the sort is memory bound. This is as expected, as in each iteration of radix sort, the whole edge_pairs and pair_weights arrays are shuffled, with cost on the order of $O(|E|)$, although the memory accesses are mostly coalesced. The memory system utilizations are as below (from NVIDIA profiler):

Type	Transactions	Bandwidth	Utilization
Shared Loads	2141139	1168.859 GB/s
Shared Stores	2486946	1357.636 GB/s
Shared Total	4628085	2526.495 GB/s	Low
L2 Reads	2533302	345.736 GB/s
L2 Writes	2831732	386.464 GB/s
L2 Total	5365034	732.200 GB/s	Low
Local Loads	588	80.248 MB/s
Local Stores	588	80.248 MB/s
Global Loads	2541637	346.873 GB/s
Global Stores	2675820	365.186 GB/s
Texture Reads	2515533	1373.242 GB/s
Unified Cache Total	7734166	2085.462 GB/s	Idle to Low
Device Reads	2469290	336.999 GB/s
Device Writes	2596403	354.347 GB/s
Device Memory Total	5065693	691.347 GB/s	High
PCIe Reads	0	0 B/s	None
PCIe Writes	5	682.381 kB/s	Idle to Low

Louvain_akamai

It’s clear that the bottleneck is at the device memory: most data are read-once and write-once, with little possibility to reuse the data. The kernel achieved 691 GB/s bandwidth utilization, ~77% of the 32GB HBM2’s 900 GB/s capability. This high-bandwidth utilization fits the memory access pattern: mostly regular and coalesced.

The particular issue here is not the kernel implementation itself, it’s the usage of sorting: fully sorting the whole edge list is perhaps overkill. One possible improvement is to use a custom hash table on the GPU, to replace the sort + segmented reduce part. The hash table could also cut the memory requirement by about 50%.

Next Steps#

Alternate approaches#

Things we tried that didn’t work well#

Segmented sort and CUB segmented reduce: the original idea for modularity optimization is to use CUB’s segmented sort and segmented reduce within each vertex’s neighborhood to get the vertex-community weights. However, CUB uses one block per each segment in these two routines, and that creates huge load imbalance, as the neighbor list length can have large differences. The solution is 1) use vertex-community pairs as keys, and CUB’s unsegmented sort; 2) implement a load-balanced segmented-reduce kernel. This solution reduces the sort-reduce time by a few X, and can be enabled by the --unify-segments flag on the command line.

STL hash table on CPU: the std::map class has performance issues when used in a multi-threading environment, and yields poor scalability; it might be caused by using locks in the STL. The solution is to use a size $|V|$ array per thread for the vertices that thread processes. Since within a thread, vertices are processed one by one, and the maximum number of communities is $|V|$, so instead of a hash table, a flat size $|V|$ array can replace the hash table. This solution significantly reduces the running time, even using a single thread. The multi-thread scalability is also much better.

Vanilla hash table on GPU: this can’t be used, as keys (vertex-community or community-community pair) and the values (edge weights) are both 64-bit, and there is currently no support of atomic-compare-and-exchange operations for 128-bit data. Even if that’s available, a vanilla table only provides insert and query operations, but not accumulation of values for the same key.

Custom hash table#

Hash tables can be used to accumulate the vertex-to-community weights during modularity optimization iterations, and to get the community-to-community edge weights during the graph contraction part of the algorithm. Previous implementations use hash tables as a common practice. However, as mentioned above, vanilla hash tables that store key-value pairs are not a good choice for Louvain.

Louvain not only needs to store the key-value pairs, but also to accumulate values associated with the same key. That key may be the same vertex-community pair for modularity optimization, or the same community-community pair for graph contraction. Ideally, if the hash table provides the following functionality, it would be much more suitable for Louvain:

Only insert the key (in the first phase of using the hash table).
In the next phase, query the positions of the keys, and use atomics to accumulate the values belonging to the same key, in a separate array.
There must be a barrier between the two phases; all insertions of the first phase need to be visible to the second phase.

This kind of hash table removes the strong restriction that the key-value pair needs to be able to support an atomic compare-and-exchange operation, which is imposed by vanilla hash table implementations. The custom hash table is also more value-accumulation-friendly. It replaces the sort-segmented reduce part, and can reduce the workload from about $O(6|E|)$ to $O(2|E|)$, and the memory requirement from $48|E|$ bytes to $24|E|$. However, the memory access pattern now becomes irregular and uncoalesced, so it is still unknown whether it can actually yield a performance gain.

Iteration and pass stop conditions#

The concurrent nature of the GPU implementation makes modularity gain calculation inaccurate. Community migrations are also observed. Because of these, the optimization iterations and the passes may run more than needed, and resulted in longer running times, especially in the first pass. Preliminary experiments to cap the number of iterations for the first pass can reduce the running time by about 40%, with the modularity value roughly unchanged. The actual cap is dataset-dependent, and more investigation is needed to get a better understanding of how to set the cap.

Gunrock implications#

The core of the Louvain implementation mainly uses all-edges advance, sort, segmented reduce, and for loops. The sort-segmented reduce operation is actually a segmented keyed reduce; if that’s a common operation that appears in more algorithms, it could be made into a new operator. The all-edges advance is used quite often in several applications, so wrapping it with a simpler operator interface could be helpful.

Notes on multi-GPU parallelization#

When parallelizing across multiple GPUs, the community assignment of local vertices and their neighbors needs to available locally, either explicitly by moving them using communication functions, or implicitly by peer memory access or unified virtual memory. The communication volume is in the order of $O(|V|)$ and the computation workload is at least in $O(|E|)$, so the scalability should be manageable.

When the number of GPUs is small, 1D partitioning can be used to divide the edges, and replicate the vertices, so there is no need to do vertex id conversion across multiple GPUs. When the number of GPUs is large, the high-low degree vertex separation and partitioning scheme can be used: edges are still distributed across GPUs, high degree vertices are duplicated, and low degree vertices are owned by one GPU each. The boundary to use different partitioning scheme is still unclear, but it’s likely that 8 GPUs within a DGX-1 can still be considered as a small number, and use the simple 1D partitioning.

Notes on dynamic graphs#

Louvain is not directly related to dynamic graphs. But it should be able to run on a dynamic graph, provided the way to access all the edges is the same. Community assignment from the previous graph can be used as a good starting point, if the vertex ids are consistent and the graph is not dramatically changed.

Notes on larger datasets#

The bottleneck of memory usage of the current implementation is in the edge pair-weight sort function: that makes up about half of the memory usage. Replacing sort-segmented reduce with a custom hash table could significantly lower the memory usage.

Louvain needs to go over the whole graph once in each modularity optimization iteration. If the graph is larger than the combined GPU memory space, which forces streaming of the graph in each modularity optimization iteration, the performance bottleneck will be CPU-GPU bandwidth. That can be an order of magnitude slower than when the graph can fully fit in GPU memory. Considering the OpenMP implementation is not that slow, using that may well be faster than moving the graph multiple times across the CPU-GPU bridge.

Notes on other pieces of this workload#

All parts of Louvain are graph related, and fully implemented in Gunrock.

Research potential#

Community detection on graphs is generally an interesting research topic, and others have targeted good GPU implementations for it. Our work is the first time Louvain is mapped to sort and segmented reduce, so more comparisons are needed to know whether it’s better than the hash table mapping. The custom hash table implementation is worth trying out, and comparing with the current sort and segmented reduce implementation. Multi-GPU implementations, particularly the graph contraction part, can be a place for research investigation: it may be better to contract the graph onto a single GPU, or even the CPU, when the graph is small; but where is the threshold and how to do the multi-GPU to single-GPU or CPU switch?

Label propagation is another algorithm for community detection. It has a similar structure to Louvain, but has a simpler way to decide how to move the vertices. Comparing them side by side for both result quality (the modularity value) and computation speed (the running time) will be beneficial.

References

[1] Hao Lu, Mahantesh Halappanavar, Ananth Kalyanaraman. “Parallel Heuristics for Scalable Community Detection”, https://arxiv.org/abs/1410.1237 (2015).

[2] Cheong C.Y., Huynh H.P., Lo D., Goh R.S.M. “Hierarchical Parallel Algorithm for Modularity-Based Community Detection Using GPUs”. Euro-Par 2013.

[3] Md. Naim, Fredrik Manne, Mahantesh Halappanavar, and Antonio Tumeo. “Highly Scalable Community Detection Using a GPU”, https://www.eecs.wsu.edu/~assefaw/CSC16/abstracts/naim-CSC16_paper_14.pdf (2016).

[4] M. Naim, F. Manne, M. Halappanavar and A. Tumeo, “Community Detection on the GPU,” IPDPS `17.

[5] Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, Etienne Lefebvre, “Fast unfolding of communities in large networks”. https://arxiv.org/abs/0803.0476 (2008).

Community Detection (Louvain)

Contents