Building

loops is header-only; the only thing you build are the example/benchmark/test binaries that exercise the headers. The repository ships a CMakePresets.json so most users never need to type a raw -D flag.

Available configure presets

Preset	Architectures	Use when
`release-native`	Host's GPU(s)	Local development on a single machine
`release-h100`	sm_90	H100 nodes
`release-a100`	sm_80	A100 nodes
`release-multi`	sm_70…sm_90	Distributing a fat binary
`debug-native`	Host's GPU(s)	Debug build with `-G -lineinfo`
`release-with-tests`	Host's GPU(s)	Build with unit tests and benchmarks enabled
`ci-multi-arch`	sm_80;sm_90	CI hosts without a GPU (CUDA 13+ compatible)

Configure and build with any of them:

cmake --preset release-h100
cmake --build --preset release-h100 -j

The output binaries land in build/<preset>/bin/.

Picking a CUDA architecture

The release-native preset sets CMAKE_CUDA_ARCHITECTURES=native, so CMake auto-detects the GPU(s) on the host at configure time. To override for cross-compilation or fat-binary builds, pass it explicitly:

# H100-only build
cmake --preset release-native -DCMAKE_CUDA_ARCHITECTURES=90

# Fat binary covering Volta through Hopper
cmake --preset release-native -DCMAKE_CUDA_ARCHITECTURES="70;75;80;86;89;90"

Note that CUDA 13.0 dropped sm_70 (Volta); use "80;90" or higher there.

Without CMake presets

If your CMake is older than 3.24, the presets are unavailable. Configure the old-fashioned way:

cmake -B build -DCMAKE_CUDA_ARCHITECTURES=90 -DCMAKE_BUILD_TYPE=Release
cmake --build build -j

Building specific examples

Each .cu file under examples/ becomes its own target named loops.<group>.<name> (for example, the SpMV thread-mapped example is loops.spmv.thread_mapped). To build just one:

cmake --build --preset release-native --target loops.spmv.merge_path

Available SpMV example targets:

loops.spmv.original — cuSPARSE reference
loops.spmv.thread_mapped, loops.spmv.group_mapped, loops.spmv.work_oriented, loops.spmv.merge_path — CSR-backed schedules
loops.spmv.ell_thread_mapped, loops.spmv.ell_merge_path — same schedules driving an ELL layout
loops.spmv.custom_layout — user-defined layout
loops.spmv.flat_partitioned — flat_uniform_occupancy<K, csr> partitioner

Other groups: loops.spmm.thread_mapped, loops.saxpy, loops.range.

Optional dependencies

Knob	Default	Effect
`LOOPS_BUILD_TESTS`	`OFF`	Build the unit tests under `unittests/`.
`LOOPS_BUILD_BENCHMARKS`	`OFF`	Build the NVBench-based benchmarks.
`LOOPS_USE_BUNDLED_CCCL`	`ON`	Use the Thrust / CUB / libcu++ that ship with the CUDA Toolkit. Set to `OFF` to fetch the pinned NVIDIA/CCCL via `FetchContent` instead.

The release-with-tests preset is the easiest way to flip the first two on:

cmake --preset release-with-tests
cmake --build --preset release-with-tests -j
ctest --preset release-with-tests

Docker

A multi-stage docker/Dockerfile and matching docker-compose.yml ship in the repo root for users who'd rather build inside a container. See docker/ for the supported CUDA_VERSION / UBUNTU_VERSION build-args and how to wire NVIDIA Container Toolkit into Compose.