Experimentation

If CUDA and CMake are already set up, follow the Getting Started instructions in the top-level README. If you'd rather work from a container, the project ships a Docker setup; instructions live in /docker.

Precisions

Every SpMV example is built twice, once per element type:

loops.spmv.<kernel>.f32 — float
loops.spmv.<kernel>.f64 — double

The CMake plumbing lives in examples/spmv/CMakeLists.txt and resolves using type_t = LOOPS_VALUE_T; per-target, so adding a new precision (e.g. __half) is a one-line edit to the LOOPS_SPMV_PRECISIONS list.

Sanity Check

After a successful build, run any of the SpMV examples on the bundled chesapeake matrix:

./build/release-native/bin/loops.spmv.merge_path.f32 \
    -m datasets/chesapeake/chesapeake.mtx --validate -v

You should see something close to:

Elapsed (ms):   0.0XX
Matrix:         chesapeake.mtx
Dimensions:     39 x 39 (340)
Errors:         0

Errors: 0 is the only number that needs to match exactly — the elapsed time naturally varies with hardware. Repeat with the other example binaries (thread_mapped, group_mapped, work_oriented, merge_path, ell_thread_mapped, ell_merge_path, custom_layout, flat_partitioned, coo_thread_mapped, csc_thread_mapped, bcsr_thread_mapped, dia_thread_mapped) and/or the .f64 variant to confirm every schedule x layout x precision combination is healthy on your GPU.

Rigorous Validation

--validate only counts naive |y_gpu - y_ref| > tolerance mismatches. On large or ill-conditioned matrices (cant, scircuit, hub-heavy graph matrices) float32 SpMV legitimately accumulates round-off larger than any fixed tolerance — a non-zero Errors count there is not a bug, just float arithmetic.

To distinguish "real bug" from "expected float round-off", pass --rigorous:

./build/release-native/bin/loops.spmv.thread_mapped.f32 \
    -m datasets/cant/cant.mtx --rigorous

This recomputes the reference with double-precision accumulation and compares the GPU output against a per-row Wilkinson bound K * nnz_row * eps * row_L1. The output adds:

WilkinsonK:           8
NaiveMismatches:      7
F32BaselineOverruns:  7
GPUOverruns:          0
MaxAbsError:          0.025
MaxRelError:          0.0024
Verdict:              NOT_A_BUG

GPUOverruns == 0 means every row's disagreement against the f64 reference is bounded by what valid float32 summation can produce — i.e. the kernel is correct, even if the naive --validate count is non-zero.

What to do next

For larger inputs and full benchmarks, see Datasets for how to fetch SuiteSparse.
To re-run the paper's experiments, see Reproducing Results.
To explore the abstraction, see Abstraction, Background, and Load-Balancing API.