Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 05 Labs — CUDA Graphs & torch.compile

Five labs on the machinery that turns a launch-bound decode loop into replayed recordings. The arc: build capture/replay and meet its two constraints (lab-01), derive the economics — crossover and ceiling (lab-02), solve the variable-batch problem with the capture-size ladder (lab-05), route decode vs mixed batches with the mode dispatch (lab-03), then measure the whole stack on real silicon (lab-04).

Recommended order: 01 → 02 → 05 → 03 → 04. (Directory numbers predate lab-05: mechanism, economics, then the two policy layers, then the measurement.) CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-05-cuda-graphs-and-torch-compile/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-01-graph-replay-simulator -q

Contents


Labs

lab-01-graph-replay-simulator [CPU-OK]

Build capture/replay on CPU: a runner that records an op sequence per shape and replays it as a single launch, copying new inputs into a static buffer first. The two infamous constraints (fixed shape, fixed addresses) fall out as the direct price of the win — a graph is a recording, and recordings don't take arguments. Mirrors mini_vllm/cudagraph.py and the real CUDAGraphWrapper. Skills: capture vs replay accounting; copyto-not-rebind; shape-keyed dispatch.

lab-02-launch-overhead [CPU-OK]

The economics as closed forms: break-even at the second call, asymptotic speedup = number of ops captured, dilution by Amdahl. Turns "graphs help low-batch decode" into formulas you can defend and extrapolate — including when not to bother (single fused op, never-repeating shapes). Skills: crossover/ceiling analysis; the upfront-cost- amortized pattern that recurs in every compilation decision.

lab-03-cudagraph-mode [CPU-OK]

Reimplement the CUDAGraphMode enum's dispatch: composite modes as (decode, mixed) tuples, FULL for uniform decode batches, PIECEWISE (split at attention) for ragged mixed ones, and the compile-time dependency requires_piecewise_compilation guards. The ten lines where chunked prefill, attention metadata, and graph constraints reconcile. Skills: the routing table; compile-time vs run-time configuration; reading the two-pass capture log.

lab-04-graph-vs-eager-real [GPU-REQ]

The validation: enforce_eager=True vs default at batch 1/8/64 on an L4 — 2.5× fading to 1.13×, exactly lab-02's curve, plus the capture log showing lab-03's two routines and lab-05's 23-rung ladder. Annotated capture included for the GPU-less. Skills: falsifiable-prediction benchmarking; extrapolating to other models/hardware; when enforce_eager is right (tests, debugging) vs wrong (serving).

lab-05-capture-sizes [CPU-OK]

The variable-batch problem: capture a ladder of sizes, pad every batch up to the nearest rung. Implement the ladder, the lookup, and the waste accounting — answering the production FAQ "why is my batch of 33 running at 40?" and quantifying the denser-ladder-vs-more-captures trade. Skills: padding as the price of replay; bucketing continuous quantities; reading cudagraph_capture_sizes.

What you can do after this phase

Explain CUDA graphs as a systems mechanism (recording + static buffers + shape dict) rather than GPU folklore; predict graph benefits for a given model size, batch distribution, and hardware before measuring; decode every line of vLLM's capture-time logging; tune cudagraph_mode and cudagraph_capture_sizes from workload evidence; and know exactly what enforce_eager=True trades, in both directions. Phase 6 changes what's inside the kernels (quantization); Phase 8's draft models are where graph mastery pays double.