Scheduling Algorithms

loops provides four static load-balancing schedules. All four consume workloads through the same layout contract, so switching between them requires changing only the schedule::setup<> template parameter.

thread_mapped

The simplest schedule: one tile per thread.

using setup_t = schedule::setup<
    schedule::algorithms_t::thread_mapped,
    1, 1, index_t, offset_t>;

Each thread iterates over all atoms in its assigned tile. Best for workloads where tiles have roughly uniform size. No shared memory or synchronization needed.

group_mapped

One tile per cooperative group (typically a warp of 32 threads).

using setup_t = schedule::setup<
    schedule::algorithms_t::group_mapped,
    BLOCK_SIZE, 32, index_t, offset_t>;

The group collaborates to process all atoms in a tile, with each thread taking a strided share. Effective when individual tiles are large enough to occupy a full warp.

work_oriented

Distributes the total atom count evenly across all threads.

using setup_t = schedule::setup<
    schedule::algorithms_t::work_oriented,
    128, 1, index_t, offset_t>;

Each thread processes a contiguous range of atoms, handling complete tiles within its range and using atomic operations for tiles that span thread boundaries. Good for skewed distributions where a few tiles dominate.

merge_path_flat

Optimal merge-based partitioning in O(tiles + atoms) work.

using setup_t = schedule::setup<
    schedule::algorithms_t::merge_path_flat,
    128, 4, index_t, offset_t>;

Uses a diagonal search on the merge-path of tiles and atoms to find the exact partition point for each thread block. Requires a preprocessing step (preprocess_t) to compute per-block starting coordinates. The most balanced of all schedules, but also the most complex.

Choosing a Schedule

Schedule	Best for	Overhead
thread_mapped	Uniform tile sizes	Minimal
group_mapped	Large tiles, high degree	Low
work_oriented	Skewed distributions	Low
merge_path_flat	Any distribution (optimal)	Preprocessing step

For most workloads, start with thread_mapped and move to merge_path_flat if you observe load imbalance.