The Tile-Atom Abstraction

The Core Idea

Every irregular workload can be described as a set of tiles and atoms:

  • Tile: A logical grouping of work (a row in CSR, a column in CSC, a block-row in BCSR).
  • Atom: The smallest unit of processing (a single nonzero element in a sparse matrix).

A schedule creates a mapping M from GPU processor IDs to balanced groups of tiles and atoms. The user writes their computation against this mapping using range-based loops.

Three Domains

The abstraction separates every irregular computation into three independent concerns:

1. Data

Choose your sparse format. Each format has a corresponding layout view that describes how tiles and atoms are organized:

Format Container Layout Tile Atom
CSR csr_t layout::csr row nonzero
CSC csc_t layout::csc column nonzero
COO coo_t layout::coo nonzero nonzero
ELL ell_t layout::ell row bucketed nz
BCSR bcsr_t layout::bcsr block-row R x C block
DIA dia_t layout::dia row diagonal cell

2. Schedule

Choose a load-balancing strategy. All four strategies consume workloads through the same layout contract:

  • thread_mapped — one tile per thread, simplest possible mapping
  • group_mapped — one tile per warp/group for high-degree tiles
  • work_oriented — distribute atoms evenly across threads
  • merge_path — optimal O(n+m) merge-based partitioning

3. Computation

Write your kernel logic once. It works unchanged across all schedules and all formats:

for (auto row : config.tiles()) {
  type_t sum = 0;
  for (auto nz : config.atoms(row)) {
    sum += values[nz] * x[indices[nz]];
  }
  y[row] = sum;
}

The Layout Contract

Any struct that exposes these methods can drive any schedule:

struct my_layout {
  tile_id_t num_tiles() const;
  atom_id_t num_atoms() const;
  atom_id_t tile_begin(tile_id_t t) const;
  atom_id_t tile_end(tile_id_t t) const;
  atom_id_t tile_size(tile_id_t t) const;
  tile_end_iterator_t tile_end_iter() const;
};

Formal Notation

Given a sparse-irregular problem S made of tiles T, where tile T_i is a collection of atoms: the scheduler creates a map M = { P_id, T_i ... T_j } from processor IDs to groups of tiles. The scheduler function L(S) = { M_0, ..., M_m } produces the complete balanced mapping.