The Tile-Atom Abstraction
The Core Idea
Every irregular workload can be described as a set of tiles and atoms:
- Tile: A logical grouping of work (a row in CSR, a column in CSC, a block-row in BCSR).
- Atom: The smallest unit of processing (a single nonzero element in a sparse matrix).
A schedule creates a mapping M from GPU processor IDs to balanced groups of tiles and atoms. The user writes their computation against this mapping using range-based loops.
Three Domains
The abstraction separates every irregular computation into three independent concerns:
1. Data
Choose your sparse format. Each format has a corresponding layout view that describes how tiles and atoms are organized:
| Format | Container | Layout | Tile | Atom |
|---|---|---|---|---|
| CSR | csr_t |
layout::csr |
row | nonzero |
| CSC | csc_t |
layout::csc |
column | nonzero |
| COO | coo_t |
layout::coo |
nonzero | nonzero |
| ELL | ell_t |
layout::ell |
row | bucketed nz |
| BCSR | bcsr_t |
layout::bcsr |
block-row | R x C block |
| DIA | dia_t |
layout::dia |
row | diagonal cell |
2. Schedule
Choose a load-balancing strategy. All four strategies consume workloads through the same layout contract:
- thread_mapped — one tile per thread, simplest possible mapping
- group_mapped — one tile per warp/group for high-degree tiles
- work_oriented — distribute atoms evenly across threads
- merge_path — optimal O(n+m) merge-based partitioning
3. Computation
Write your kernel logic once. It works unchanged across all schedules and all formats:
for (auto row : config.tiles()) {
type_t sum = 0;
for (auto nz : config.atoms(row)) {
sum += values[nz] * x[indices[nz]];
}
y[row] = sum;
}
The Layout Contract
Any struct that exposes these methods can drive any schedule:
struct my_layout {
tile_id_t num_tiles() const;
atom_id_t num_atoms() const;
atom_id_t tile_begin(tile_id_t t) const;
atom_id_t tile_end(tile_id_t t) const;
atom_id_t tile_size(tile_id_t t) const;
tile_end_iterator_t tile_end_iter() const;
};
Formal Notation
Given a sparse-irregular problem S made of tiles T, where tile T_i is a collection of atoms: the scheduler creates a map M = { P_id, T_i ... T_j } from processor IDs to groups of tiles. The scheduler function L(S) = { M_0, ..., M_m } produces the complete balanced mapping.