Phase 05 Labs — CUDA Graphs & torch.compile
Five labs on the machinery that turns a launch-bound decode loop into replayed recordings. The arc: build capture/replay and meet its two constraints (lab-01), derive the economics — crossover and ceiling (lab-02), solve the variable-batch problem with the capture-size ladder (lab-05), route decode vs mixed batches with the mode dispatch (lab-03), then measure the whole stack on real silicon (lab-04).
Recommended order: 01 → 02 → 05 → 03 → 04. (Directory numbers predate lab-05:
mechanism, economics, then the two policy layers, then the measurement.) CPU labs follow
the standard contract — starter.py (your work), solution.py (reference), test_lab.py
(the spec); default runs the solution, LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-05-cuda-graphs-and-torch-compile/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-01-graph-replay-simulator -q
Contents
- lab-01-graph-replay-simulator
[CPU-OK] - lab-02-launch-overhead
[CPU-OK] - lab-03-cudagraph-mode
[CPU-OK] - lab-04-graph-vs-eager-real
[GPU-REQ] - lab-05-capture-sizes
[CPU-OK] - What you can do after this phase
Labs
lab-01-graph-replay-simulator [CPU-OK]
Build capture/replay on CPU: a runner that records an op sequence per shape and replays
it as a single launch, copying new inputs into a static buffer first. The two infamous
constraints (fixed shape, fixed addresses) fall out as the direct price of the win — a
graph is a recording, and recordings don't take arguments. Mirrors
mini_vllm/cudagraph.py and the real CUDAGraphWrapper. Skills: capture vs replay
accounting; copyto-not-rebind; shape-keyed dispatch.
lab-02-launch-overhead [CPU-OK]
The economics as closed forms: break-even at the second call, asymptotic speedup = number of ops captured, dilution by Amdahl. Turns "graphs help low-batch decode" into formulas you can defend and extrapolate — including when not to bother (single fused op, never-repeating shapes). Skills: crossover/ceiling analysis; the upfront-cost- amortized pattern that recurs in every compilation decision.
lab-03-cudagraph-mode [CPU-OK]
Reimplement the CUDAGraphMode enum's dispatch: composite modes as (decode, mixed)
tuples, FULL for uniform decode batches, PIECEWISE (split at attention) for ragged mixed
ones, and the compile-time dependency requires_piecewise_compilation guards. The ten
lines where chunked prefill, attention metadata, and graph constraints reconcile.
Skills: the routing table; compile-time vs run-time configuration; reading the
two-pass capture log.
lab-04-graph-vs-eager-real [GPU-REQ]
The validation: enforce_eager=True vs default at batch 1/8/64 on an L4 — 2.5× fading
to 1.13×, exactly lab-02's curve, plus the capture log showing lab-03's two routines and
lab-05's 23-rung ladder. Annotated capture included for the GPU-less. Skills:
falsifiable-prediction benchmarking; extrapolating to other models/hardware; when
enforce_eager is right (tests, debugging) vs wrong (serving).
lab-05-capture-sizes [CPU-OK]
The variable-batch problem: capture a ladder of sizes, pad every batch up to the
nearest rung. Implement the ladder, the lookup, and the waste accounting — answering the
production FAQ "why is my batch of 33 running at 40?" and quantifying the
denser-ladder-vs-more-captures trade. Skills: padding as the price of replay; bucketing
continuous quantities; reading cudagraph_capture_sizes.
What you can do after this phase
Explain CUDA graphs as a systems mechanism (recording + static buffers + shape dict)
rather than GPU folklore; predict graph benefits for a given model size, batch
distribution, and hardware before measuring; decode every line of vLLM's capture-time
logging; tune cudagraph_mode and cudagraph_capture_sizes from workload evidence; and
know exactly what enforce_eager=True trades, in both directions. Phase 6 changes what's
inside the kernels (quantization); Phase 8's draft models are where graph mastery pays
double.