Phase 04 Labs — Attention Backends
Four labs that take you inside the kernels the scheduler commands. The arc: build the decode kernel's algorithm (lab-01), widen it to the prefill shape with causal bounds (lab-03), parallelize it with the mergeable-state trick (lab-04), then step back and map the stable of production backends and the selector that picks between them (lab-02).
Recommended order: 01 → 03 → 04 → 02. (Directory numbers predate labs 03–04: algorithm
first, then its two extensions, then the dispatcher.) CPU labs follow the standard
contract — starter.py (your work), solution.py (reference), test_lab.py (the spec);
default runs the solution, LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-04-attention-backends/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-04-attention-backends/labs/lab-01-paged-attention-gather -q
Contents
- lab-01-paged-attention-gather
[CPU-OK] - lab-02-backend-selection
[GPU-OPT] - lab-03-causal-prefill-attention
[CPU-OK] - lab-04-flash-decoding-partitions
[CPU-OK] - What you can do after this phase
Labs
lab-01-paged-attention-gather [CPU-OK]
The fusion lab: online-softmax (FlashAttention's running max / denominator / accumulator
recurrence) over a paged KV cache (PagedAttention's block-table gather), in ~25 lines of
numpy, proven equal to dense attention — including the partial-last-block bound and the
m = −inf first-block edge. This is the semantics of paged_attention_v1.cu, and the
foundation labs 03 and 04 build on. Skills: the recurrence and why it's exact; the
rescaling correction factor; mapping your variables onto the CUDA kernel's.
lab-02-backend-selection [GPU-OPT]
Run the selector, override it (VLLM_ATTENTION_BACKEND), read get_attn_backend
(selector.py:52), and build the (GPU, dtype, model) → backend matrix — including why MLA
models force a backend while sliding windows merely filter candidates. Captured output
included for the GPU-less. Skills: the two-run kernel-bisection habit; backends differ
in the last ulp legitimately; why selection is startup-time configuration.
lab-03-causal-prefill-attention [CPU-OK]
The prefill shape: M queries starting at start_pos, each attending over exactly its
causal prefix — where the mask degenerates into a loop bound and chunked prefill becomes
just "queries that don't start at zero." The payoff test proves chunked ≡ one-shot in
attention outputs (Phase 3 lab-02's theorem, at the layer that enforces it), and a
poisoned-future test makes causality violations deafening. Skills: decode vs prefill as
loop shapes; start_pos/query_start_loc metadata; why prefill is compute-bound in this
very loop nest.
lab-04-flash-decoding-partitions [CPU-OK]
The parallelism lab: attention state compresses to a mergeable (max, denom, unnormalized-acc) triple, so a 128k-token decode can be split across partitions computed
independently and merged exactly — any partition count, any merge order, any tree shape,
all 1e-12-equal to dense. This is paged_attention_v2, flash-decoding, FlashInfer
split-k, and (stretched across GPUs) Phase 10's context parallelism. Skills: the
attention monoid; never normalize a partial; why long-context decode is where backends
differ.
What you can do after this phase
Read any attention backend in vllm/v1/attention/backends/ and find the three things that
are always there: the streaming recurrence (lab-01), the shape/metadata handling for
prefill vs decode (lab-03), and the reduction strategy (lab-04). Diagnose a kernel
suspicion with the backend-override bisection (lab-02), predict which backend a deployment
runs before it starts, and explain to a colleague why paged + flash + split-KV compose
without approximation. Phase 5 freezes these kernels into CUDA graphs; Phase 7 goes below
them into GEMMs.