Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 04 Labs — Attention Backends

Four labs that take you inside the kernels the scheduler commands. The arc: build the decode kernel's algorithm (lab-01), widen it to the prefill shape with causal bounds (lab-03), parallelize it with the mergeable-state trick (lab-04), then step back and map the stable of production backends and the selector that picks between them (lab-02).

Recommended order: 01 → 03 → 04 → 02. (Directory numbers predate labs 03–04: algorithm first, then its two extensions, then the dispatcher.) CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-04-attention-backends/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-04-attention-backends/labs/lab-01-paged-attention-gather -q

Contents


Labs

lab-01-paged-attention-gather [CPU-OK]

The fusion lab: online-softmax (FlashAttention's running max / denominator / accumulator recurrence) over a paged KV cache (PagedAttention's block-table gather), in ~25 lines of numpy, proven equal to dense attention — including the partial-last-block bound and the m = −inf first-block edge. This is the semantics of paged_attention_v1.cu, and the foundation labs 03 and 04 build on. Skills: the recurrence and why it's exact; the rescaling correction factor; mapping your variables onto the CUDA kernel's.

lab-02-backend-selection [GPU-OPT]

Run the selector, override it (VLLM_ATTENTION_BACKEND), read get_attn_backend (selector.py:52), and build the (GPU, dtype, model) → backend matrix — including why MLA models force a backend while sliding windows merely filter candidates. Captured output included for the GPU-less. Skills: the two-run kernel-bisection habit; backends differ in the last ulp legitimately; why selection is startup-time configuration.

lab-03-causal-prefill-attention [CPU-OK]

The prefill shape: M queries starting at start_pos, each attending over exactly its causal prefix — where the mask degenerates into a loop bound and chunked prefill becomes just "queries that don't start at zero." The payoff test proves chunked ≡ one-shot in attention outputs (Phase 3 lab-02's theorem, at the layer that enforces it), and a poisoned-future test makes causality violations deafening. Skills: decode vs prefill as loop shapes; start_pos/query_start_loc metadata; why prefill is compute-bound in this very loop nest.

lab-04-flash-decoding-partitions [CPU-OK]

The parallelism lab: attention state compresses to a mergeable (max, denom, unnormalized-acc) triple, so a 128k-token decode can be split across partitions computed independently and merged exactly — any partition count, any merge order, any tree shape, all 1e-12-equal to dense. This is paged_attention_v2, flash-decoding, FlashInfer split-k, and (stretched across GPUs) Phase 10's context parallelism. Skills: the attention monoid; never normalize a partial; why long-context decode is where backends differ.

What you can do after this phase

Read any attention backend in vllm/v1/attention/backends/ and find the three things that are always there: the streaming recurrence (lab-01), the shape/metadata handling for prefill vs decode (lab-03), and the reduction strategy (lab-04). Diagnose a kernel suspicion with the backend-override bisection (lab-02), predict which backend a deployment runs before it starts, and explain to a colleague why paged + flash + split-KV compose without approximation. Phase 5 freezes these kernels into CUDA graphs; Phase 7 goes below them into GEMMs.