Lab 05-05 — Capture Sizes: Bucketing Batches into Graphs `[CPU-OK]`

Lab-01 left you with a tension it didn't resolve. A CUDA graph is captured per shape (Constraint 1) — but a decode batch's size changes every step as requests join and finish (Phase 1 lab-04 showed you the churn). Capture a graph for every possible batch size from 1 to max_num_seqs? That's hundreds of captures: minutes of startup and gigabytes of graph memory. Capture only a few? Then most steps have no matching graph. vLLM's answer is the capture-size ladder: a curated list of sizes, with every batch padded up to the nearest rung. In this lab you implement the ladder, the rung lookup, and the waste accounting — and answer the production question this mechanism generates weekly: "why is my batch of 33 running at size 40?"

Why this lab exists
Background: padding as the price of replay
Files
Run
What the tests prove
Hitchhiker's notes
Going further
References

Why this lab exists

This is the lab where CUDA graphs stop being a binary feature ("on = fast") and become a budgeted trade you can reason about quantitatively. Every rung in the ladder costs capture time at startup and graph-pool memory forever; every gap between rungs costs padded rows — real FLOPs spent computing garbage that's discarded — on every step that lands in the gap. The deliverable skill: given a workload's batch-size distribution, say whether the default ladder fits it, and what changing cudagraph_capture_sizes (or max capture size) would buy. That's a real tuning lever (Phase 18) hiding behind an innocuous config list.

It also explains two log lines and one metric that otherwise mystify operators: the Capturing CUDA graphs ... 23/23 startup progress bar (that's the ladder's length — count the rungs in default_capture_sizes(512)), the graph-pool memory in took 0.41 GiB (lab-04's capture pass), and the small constant gap between num_running and the batch size the profiler shows (the padding).

Background: padding as the price of replay

Replay requires the captured shape, exactly (lab-01, Constraint 2: same buffers, same sizes). A batch of 33 with a graph captured at 40 runs as follows: the 33 real rows are copied into the static input buffer, rows 34–40 are filled with junk (typically zeros or stale data — and it doesn't matter, because their outputs are never read), and the whole 40-row graph replays. The padded rows cost ~7/40 ≈ 17% extra compute for that step — almost always cheaper than the alternative (an eager step paying per-kernel launches), because decode steps are launch-overhead-dominated at exactly these small sizes (lab-02's regime, Phase 0 lab-04's roofline).

The ladder's shape encodes where padding hurts: rungs are dense at small sizes ([1, 2, 4], then every 8) because relative waste is worst there — padding 3 → 4 is 33%, padding 250 → 256 is 2.4%. Above the largest rung the engine just runs eagerly: at that much GPU work per step, launch overhead is amortized anyway and graphs stop mattering (lab-04's shrinking gap, measured).

Files

starter.py — default_capture_sizes, select_capture_size, padded_tokens, trace_waste. Your work.
solution.py — reference.
test_lab.py — the ladder's exact shape, exact-rung hits, round-up, eager fallback, trace accounting, and the density trade-off.

Run

LAB_IMPL=starter pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-05-capture-sizes -q
pytest phase-05-cuda-graphs-and-torch-compile/labs/lab-05-capture-sizes -q   # reference

What the tests prove

Test	What it pins
`test_ladder_shape`	`[1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64]` for max 64 — dense low, every-8 above
`test_exact_rung_pays_nothing`	Landing on a rung is free — and why benchmarks at batch 1/8/64 can overstate graph benefits vs your real traffic at batch 33
`test_between_rungs_rounds_up`	33 → 40 (waste 7): the production FAQ, answered with arithmetic
`test_oversize_batch_runs_eager`	Above the top rung: no padding, no graph — the eager fallback is a normal path, not an error
`test_trace_accounting`	Waste summed over a step trace — the quantity you'd actually plot for a workload
`test_finer_ladder_trades_graphs_for_padding`	Denser ladder = less padding but more rungs to capture: the whole design space in one assert

Hitchhiker's notes

Where this lives upstream: cudagraph_capture_sizes in upstream/vllm/config/compilation.py, consumed by the model runner's dummy-run capture loop at startup and by the per-step batch padding (search pad in gpu_model_runner.py). The real ladder is the same shape as yours with a higher default ceiling (typically 512).
Padding interacts with the sampler, not just the GEMMs. Padded rows produce logits too; the engine must not sample from them. Upstream handles it by slicing real rows out before sampling (logits_indices again — Phase 1 lab-03's guard doing one more job). When you see careful index plumbing around batch padding in a PR, this is what it's protecting.
The same bucketing idea recurs everywhere shapes must be finite: torch.compile's dynamic-shape buckets, TensorRT optimization profiles, XLA padding on TPUs (Phase 17 — where padding costs are far more dramatic). "Continuous quantity → discrete ladder + round up" is a pattern, and its failure mode is always the same: a workload that sits just above a rung, paying maximum waste consistently. Check the distribution, not the mean.
Why not capture on demand — first time a size appears, capture it? Capture requires a warmup run, allocations, and stream quiescence: a multi-hundred-ms stall mid-serving the first time batch=37 shows up, and unbounded graph memory growth over a day of traffic. Startup capture converts an unpredictable runtime stall into a predictable boot cost — the same "pay it where you can see it" philosophy as the Phase 2 lab-03 memory profiling pass.

Going further

Take the batch-size trace from Phase 1 lab-04's probe (lengths of each step's dict) and run trace_waste over it for three ladders: default, powers-of-two only, every-4. Compute waste as a fraction of real rows — for bursty traces the answer often surprises.
Model capture cost: give each rung a fixed cost (say, 0.3 s + 8 MB) and find, for a given trace length, the ladder that minimizes total cost (capture + padded-row time). You've just turned a config knob into an optimization problem — Phase 18's worldview.
Read the capture loop in gpu_model_runner.py (search capture) and find where the ladder is iterated largest first — then work out why (memory-pool reuse: the biggest graph's buffers can be shared by the smaller ones).

References

upstream/vllm/config/compilation.py — cudagraph_capture_sizes and the mode enum (lab-03).
upstream/vllm/v1/worker/gpu_model_runner.py — the capture loop and per-step padding.
vLLM blog, vLLM V1 — the compilation + capture architecture: https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
NVIDIA, CUDA Graphs (programming guide) — what capture/replay actually does: https://docs.nvidia.com/cuda/cuda-c-programming-guide/#cuda-graphs
Lab-04 — the measured shrinking-gap curve this ladder is tuned against.

vLLM Mastery — From Zero to Maintainer