Lab 05-05 — Capture Sizes: Bucketing Batches into Graphs [CPU-OK]
Lab-01 left you with a tension it didn't resolve. A CUDA graph is captured per shape
(Constraint 1) — but a decode batch's size changes every step as requests join and finish
(Phase 1 lab-04 showed you the churn). Capture a graph for every possible batch size from
1 to max_num_seqs? That's hundreds of captures: minutes of startup and gigabytes of
graph memory. Capture only a few? Then most steps have no matching graph. vLLM's answer is
the capture-size ladder: a curated list of sizes, with every batch padded up to the
nearest rung. In this lab you implement the ladder, the rung lookup, and the waste
accounting — and answer the production question this mechanism generates weekly: "why is
my batch of 33 running at size 40?"
Contents
- Why this lab exists
- Background: padding as the price of replay
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
This is the lab where CUDA graphs stop being a binary feature ("on = fast") and become a
budgeted trade you can reason about quantitatively. Every rung in the ladder costs
capture time at startup and graph-pool memory forever; every gap between rungs costs
padded rows — real FLOPs spent computing garbage that's discarded — on every step that
lands in the gap. The deliverable skill: given a workload's batch-size distribution, say
whether the default ladder fits it, and what changing cudagraph_capture_sizes (or
max capture size) would buy. That's a real tuning lever (Phase 18) hiding behind an
innocuous config list.
It also explains two log lines and one metric that otherwise mystify operators: the
Capturing CUDA graphs ... 23/23 startup progress bar (that's the ladder's length — count
the rungs in default_capture_sizes(512)), the graph-pool memory in took 0.41 GiB
(lab-04's capture pass), and the small constant gap between num_running and the batch
size the profiler shows (the padding).
Background: padding as the price of replay
Replay requires the captured shape, exactly (lab-01, Constraint 2: same buffers, same sizes). A batch of 33 with a graph captured at 40 runs as follows: the 33 real rows are copied into the static input buffer, rows 34–40 are filled with junk (typically zeros or stale data — and it doesn't matter, because their outputs are never read), and the whole 40-row graph replays. The padded rows cost ~7/40 ≈ 17% extra compute for that step — almost always cheaper than the alternative (an eager step paying per-kernel launches), because decode steps are launch-overhead-dominated at exactly these small sizes (lab-02's regime, Phase 0 lab-04's roofline).
The ladder's shape encodes where padding hurts: rungs are dense at small sizes ([1, 2, 4], then every 8) because relative waste is worst there — padding 3 → 4 is 33%, padding 250 → 256 is 2.4%. Above the largest rung the engine just runs eagerly: at that much GPU work per step, launch overhead is amortized anyway and graphs stop mattering (lab-04's shrinking gap, measured).
Files
starter.py—default_capture_sizes,select_capture_size,padded_tokens,trace_waste. Your work.solution.py— reference.test_lab.py— the ladder's exact shape, exact-rung hits, round-up, eager fallback, trace accounting, and the density trade-off.
Run
LAB_IMPL=starter pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-05-capture-sizes -q
pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-05-capture-sizes -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_ladder_shape | [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64] for max 64 — dense low, every-8 above |
test_exact_rung_pays_nothing | Landing on a rung is free — and why benchmarks at batch 1/8/64 can overstate graph benefits vs your real traffic at batch 33 |
test_between_rungs_rounds_up | 33 → 40 (waste 7): the production FAQ, answered with arithmetic |
test_oversize_batch_runs_eager | Above the top rung: no padding, no graph — the eager fallback is a normal path, not an error |
test_trace_accounting | Waste summed over a step trace — the quantity you'd actually plot for a workload |
test_finer_ladder_trades_graphs_for_padding | Denser ladder = less padding but more rungs to capture: the whole design space in one assert |
Hitchhiker's notes
- Where this lives upstream:
cudagraph_capture_sizesinupstream/vllm/config/compilation.py, consumed by the model runner's dummy-run capture loop at startup and by the per-step batch padding (searchpadingpu_model_runner.py). The real ladder is the same shape as yours with a higher default ceiling (typically 512). - Padding interacts with the sampler, not just the GEMMs. Padded rows produce logits
too; the engine must not sample from them. Upstream handles it by slicing real rows
out before sampling (
logits_indicesagain — Phase 1 lab-03's guard doing one more job). When you see careful index plumbing around batch padding in a PR, this is what it's protecting. - The same bucketing idea recurs everywhere shapes must be finite: torch.compile's dynamic-shape buckets, TensorRT optimization profiles, XLA padding on TPUs (Phase 17 — where padding costs are far more dramatic). "Continuous quantity → discrete ladder + round up" is a pattern, and its failure mode is always the same: a workload that sits just above a rung, paying maximum waste consistently. Check the distribution, not the mean.
- Why not capture on demand — first time a size appears, capture it? Capture requires a warmup run, allocations, and stream quiescence: a multi-hundred-ms stall mid-serving the first time batch=37 shows up, and unbounded graph memory growth over a day of traffic. Startup capture converts an unpredictable runtime stall into a predictable boot cost — the same "pay it where you can see it" philosophy as the Phase 2 lab-03 memory profiling pass.
Going further
- Take the batch-size trace from Phase 1 lab-04's probe (lengths of each step's dict) and
run
trace_wasteover it for three ladders: default, powers-of-two only, every-4. Compute waste as a fraction of real rows — for bursty traces the answer often surprises. - Model capture cost: give each rung a fixed cost (say, 0.3 s + 8 MB) and find, for a given trace length, the ladder that minimizes total cost (capture + padded-row time). You've just turned a config knob into an optimization problem — Phase 18's worldview.
- Read the capture loop in
gpu_model_runner.py(searchcapture) and find where the ladder is iterated largest first — then work out why (memory-pool reuse: the biggest graph's buffers can be shared by the smaller ones).
References
upstream/vllm/config/compilation.py—cudagraph_capture_sizesand the mode enum (lab-03).upstream/vllm/v1/worker/gpu_model_runner.py— the capture loop and per-step padding.- vLLM blog, vLLM V1 — the compilation + capture architecture: https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
- NVIDIA, CUDA Graphs (programming guide) — what capture/replay actually does: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#cuda-graphs
- Lab-04 — the measured shrinking-gap curve this ladder is tuned against.