loops — Expressing Parallel Irregular Computations

The Problem

I. Irregular Data Doesn’t Fit Regular Hardware

Sparse matrices, power-law graphs, variable-length lists — the data structures behind machine learning, numerical simulation, and graph analytics don’t map well to the GPU’s SIMD architecture. Naïve thread mapping leaves some threads overloaded and others idle.

Prior solutions were tightly coupled to specific applications — one load-balancing strategy per codebase, impossible to reuse or compare.

The Abstraction

II. Three Concerns, Separated

Every irregular workload is made of tiles (rows, columns, blocks) and atoms (nonzeros, elements). A schedule maps balanced groups of atoms to GPU threads. Your computation stays the same regardless of which schedule you choose.

The Code

III. Same Kernel, Different Schedules

Switch between tabs. The loop scaffolding adapts to each schedule, but look at the highlighted line — the actual computation never changes. That’s the entire point.

using setup_t = schedule::setup<
    schedule::algorithms_t::thread_mapped,
    1, 1, index_t, offset_t>;

setup_t config(offsets, rows, nnzs);

for (auto row : config.tiles()) {
  type_t sum = 0;
  for (auto nz : config.atoms(row)) {
    sum += values[nz] * x[indices[nz]];
  }
  y[row] = sum;
}

using setup_t = schedule::setup<
    schedule::algorithms_t::work_oriented,
    128, 1, index_t, offset_t>;

setup_t config(offsets, rows, nnzs);

for (auto row : config.tiles(map)) {
  type_t sum = 0;
  for (auto nz : config.atoms(row, map)) {
    sum += values[nz] * x[indices[nz]];
  }
  y[row] = sum;
}

using setup_t = schedule::setup<
    schedule::algorithms_t::merge_path_flat,
    128, 4, index_t, offset_t>;

setup_t config(meta, buffer, offsets, rows, nnzs);

for (auto vid : config.virtual_idx()) {
  auto nz  = config.atom_idx(vid, coord);
  auto row = config.tile_idx(coord);
  sum += values[nz] * x[indices[nz]];
}

Figure 1. SpMV kernel across three scheduling strategies. The computation (highlighted) is invariant — only the schedule setup adapts.

Format-Generic Layouts

IV. Any Sparse Format, Same Schedules

The four scheduling algorithms talk to workloads through a small layout view contract. Any struct that exposes the contract can drive any schedule without modification.

Layout	Tile is a…	Atom is a…	Container
`layout::csr`	row	nonzero	`csr_t`
`layout::csc`	column	nonzero	`csc_t`
`layout::coo`	nonzero	nonzero	`coo_t`
`layout::ell`	row	bucketed nz	`ell_t`
`layout::bcsr`	block-row	dense R×C block	`bcsr_t`
`layout::dia`	row	diagonal cell	`dia_t`

Cite

V. PPoPP 2023

If you use loops in your research, please cite:

@inproceedings{Osama:2023:APM, author = {Muhammad Osama and Serban D. Porumbescu and John D. Owens}, title = {A Programming Model for {GPU} Load Balancing}, booktitle = {PPoPP 2023}, year = 2023, doi = {10.1145/3572848.3577434}, }

Expressing ParallelIrregular Computations

I. Irregular Data Doesn’t Fit Regular Hardware

II. Three Concerns, Separated

III. Same Kernel, Different Schedules

IV. Any Sparse Format, Same Schedules

V. PPoPP 2023

Expressing Parallel
Irregular Computations