A GPU load-balancing framework that decouples scheduling from computation.
Sparse matrices, power-law graphs, variable-length lists — the data structures behind machine learning, numerical simulation, and graph analytics don’t map well to the GPU’s SIMD architecture. Naïve thread mapping leaves some threads overloaded and others idle.
Prior solutions were tightly coupled to specific applications — one load-balancing strategy per codebase, impossible to reuse or compare.
Every irregular workload is made of tiles (rows, columns, blocks) and atoms (nonzeros, elements). A schedule maps balanced groups of atoms to GPU threads. Your computation stays the same regardless of which schedule you choose.
Switch between tabs. The loop scaffolding adapts to each schedule, but look at the highlighted line — the actual computation never changes. That’s the entire point.
using setup_t = schedule::setup<
schedule::algorithms_t::thread_mapped,
1, 1, index_t, offset_t>;
setup_t config(offsets, rows, nnzs);
for (auto row : config.tiles()) {
type_t sum = 0;
for (auto nz : config.atoms(row)) {
sum += values[nz] * x[indices[nz]];
}
y[row] = sum;
}
using setup_t = schedule::setup<
schedule::algorithms_t::work_oriented,
128, 1, index_t, offset_t>;
setup_t config(offsets, rows, nnzs);
for (auto row : config.tiles(map)) {
type_t sum = 0;
for (auto nz : config.atoms(row, map)) {
sum += values[nz] * x[indices[nz]];
}
y[row] = sum;
}
using setup_t = schedule::setup<
schedule::algorithms_t::merge_path_flat,
128, 4, index_t, offset_t>;
setup_t config(meta, buffer, offsets, rows, nnzs);
for (auto vid : config.virtual_idx()) {
auto nz = config.atom_idx(vid, coord);
auto row = config.tile_idx(coord);
sum += values[nz] * x[indices[nz]];
}
The four scheduling algorithms talk to workloads through a small layout view contract. Any struct that exposes the contract can drive any schedule without modification.
| Layout | Tile is a… | Atom is a… | Container |
|---|---|---|---|
layout::csr | row | nonzero | csr_t |
layout::csc | column | nonzero | csc_t |
layout::coo | nonzero | nonzero | coo_t |
layout::ell | row | bucketed nz | ell_t |
layout::bcsr | block-row | dense R×C block | bcsr_t |
layout::dia | row | diagonal cell | dia_t |
If you use loops in your research, please cite: