Expressing Parallel
Irregular Computations

A GPU load-balancing framework that decouples scheduling from computation.

The Problem

I. Irregular Data Doesn’t Fit Regular Hardware

Sparse matrices, power-law graphs, variable-length lists — the data structures behind machine learning, numerical simulation, and graph analytics don’t map well to the GPU’s SIMD architecture. Naïve thread mapping leaves some threads overloaded and others idle.

Prior solutions were tightly coupled to specific applications — one load-balancing strategy per codebase, impossible to reuse or compare.

The Abstraction

II. Three Concerns, Separated

Every irregular workload is made of tiles (rows, columns, blocks) and atoms (nonzeros, elements). A schedule maps balanced groups of atoms to GPU threads. Your computation stays the same regardless of which schedule you choose.

The Code

III. Same Kernel, Different Schedules

Switch between tabs. The loop scaffolding adapts to each schedule, but look at the highlighted line — the actual computation never changes. That’s the entire point.

using setup_t = schedule::setup<
    schedule::algorithms_t::thread_mapped,
    1, 1, index_t, offset_t>;

setup_t config(offsets, rows, nnzs);

for (auto row : config.tiles()) {
  type_t sum = 0;
  for (auto nz : config.atoms(row)) {
    sum += values[nz] * x[indices[nz]];
  }
  y[row] = sum;
}
using setup_t = schedule::setup<
    schedule::algorithms_t::work_oriented,
    128, 1, index_t, offset_t>;

setup_t config(offsets, rows, nnzs);

for (auto row : config.tiles(map)) {
  type_t sum = 0;
  for (auto nz : config.atoms(row, map)) {
    sum += values[nz] * x[indices[nz]];
  }
  y[row] = sum;
}
using setup_t = schedule::setup<
    schedule::algorithms_t::merge_path_flat,
    128, 4, index_t, offset_t>;

setup_t config(meta, buffer, offsets, rows, nnzs);

for (auto vid : config.virtual_idx()) {
  auto nz  = config.atom_idx(vid, coord);
  auto row = config.tile_idx(coord);
  sum += values[nz] * x[indices[nz]];
}
Figure 1. SpMV kernel across three scheduling strategies. The computation (highlighted) is invariant — only the schedule setup adapts.
Format-Generic Layouts

IV. Any Sparse Format, Same Schedules

The four scheduling algorithms talk to workloads through a small layout view contract. Any struct that exposes the contract can drive any schedule without modification.

Layout Tile is a… Atom is a… Container
layout::csrrownonzerocsr_t
layout::csccolumnnonzerocsc_t
layout::coononzerononzerocoo_t
layout::ellrowbucketed nzell_t
layout::bcsrblock-rowdense R×C blockbcsr_t
layout::diarowdiagonal celldia_t
Cite

V. PPoPP 2023

If you use loops in your research, please cite:

@inproceedings{Osama:2023:APM, author = {Muhammad Osama and Serban D. Porumbescu and John D. Owens}, title = {A Programming Model for {GPU} Load Balancing}, booktitle = {PPoPP 2023}, year = 2023, doi = {10.1145/3572848.3577434}, }