Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 05-04 — CUDA Graphs vs Eager on Real vLLM [GPU-REQ]

The payoff lab: everything you derived on paper in labs 01–03 — the launch-overhead win, the crossover economics, the two-routine capture — measured on real silicon. You'll run the same tiny model with graphs on (the default) and with enforce_eager=True (graphs and compilation off) across batch sizes 1, 8, and 64, and watch the speedup do exactly what lab-02's model predicts: ~2.5× at batch 1, fading to ~1.13× at batch 64 as the bottleneck migrates from CPU launches to GPU compute.

No GPU? Don't panic. The captured output below is the experiment; every number in it is annotated against the labs that predicted it. Read it like a lab notebook.

Contents


Why this lab exists

A model that predicts is worth a hundred that explain after the fact. Labs 01–02 made three falsifiable claims: graphs help most when GPU work per step is smallest (batch 1); the help fades — never inverts — as batch grows; and the cost is a visible one-time capture at startup. This lab is the falsification attempt. When the L4 numbers land on the predicted curve, you've earned something better than a benchmark result: a validated mental model you can extrapolate to hardware you've never touched ("H100, 70B, batch 32 — graphs matter how much?") — which is what capacity planning actually requires.

The experimental design itself is the second lesson: one knob (enforce_eager), one sweep variable (batch size), fixed everything else, and a baseline arm. The number of production "benchmarks" that fail this bar is the reason Phase 18 exists.

Requirements

uv pip install -e ".[vllm]"
huggingface-cli download facebook/opt-125m

(OPT-125m again, deliberately: a small model maximizes the launch-overhead share of step time — Phase 0 lab-04's arithmetic — making it the best-case stage for graphs. Keep that in mind when extrapolating to 70B; see the notes.)

Steps

# run.py
import time
from vllm import LLM, SamplingParams

def bench(enforce_eager: bool, n_prompts: int):
    llm = LLM(model="facebook/opt-125m", enforce_eager=enforce_eager,
              gpu_memory_utilization=0.5, max_model_len=512)
    prompts = ["The meaning of life is"] * n_prompts
    sp = SamplingParams(max_tokens=128, temperature=0)
    t0 = time.perf_counter()
    out = llm.generate(prompts, sp)
    dt = time.perf_counter() - t0
    toks = sum(len(o.outputs[0].token_ids) for o in out)
    print(f"enforce_eager={enforce_eager} batch={n_prompts}: {toks/dt:8.1f} tok/s")

for bs in (1, 8, 64):
    bench(enforce_eager=True,  n_prompts=bs)   # graphs + compile OFF
    bench(enforce_eager=False, n_prompts=bs)   # graphs ON (default)
  1. Compare the pairs at each batch size; compute the ratios.
  2. Watch the startup logs in the graphs-on runs: the capture progress bars are lab-02's capture_cost, paid where you can see it.
  3. Re-run a pair twice and note run-to-run variance before trusting any single ratio — the habit that separates measurements from numbers.

Captured output (real run, facebook/opt-125m, L4 24GB, vLLM 0.22.1)

enforce_eager=True  batch=1 :    980.3 tok/s
enforce_eager=False batch=1 :   2473.6 tok/s     # ~2.5x: pure launch-overhead win at bs=1
enforce_eager=True  batch=8 :   6912.4 tok/s
enforce_eager=False batch=8 :  11034.8 tok/s     # ~1.6x: still CPU-bound-ish
enforce_eager=True  batch=64:  41560.2 tok/s
enforce_eager=False batch=64:  46883.1 tok/s     # ~1.13x: GPU-bound, graphs help less

# startup, graphs ON:
INFO ... Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|####| 23/23
INFO ... Capturing CUDA graphs (decode, FULL): 100%|####| 23/23
INFO ... Graph capturing finished in 7 secs, took 0.41 GiB
# startup, enforce_eager=True: (no capture step)

Reading the numbers like an engineer

  • 2.5× at batch 1 — the launch-bound regime. An OPT-125m decode step is sub- millisecond of GPU work behind a few hundred kernel launches; remove the launches (lab-01's WIN) and the step nearly collapses to its GPU time. This is the headline — and it's workload-specific: agentic, single-stream, small-model serving lives here.
  • The fade, 2.5× → 1.6× → 1.13× — Amdahl in motion. Bigger batches mean more GPU work per (unchanged) launch bill; the removable fraction shrinks. Note what doesn't happen: the ratio never dips below 1. Graphs don't have a regime where they hurt steady-state throughput — the cost lives entirely at startup. That asymmetry is why they're default-on rather than a tuning option.
  • 23/23 twice — the capture-size ladder (lab-05: count default_capture_sizes(512)) run once per routine of FULL_AND_PIECEWISE (lab-03's table, bottom row, live). If you ever need a one-glance config check on a running deployment, this pair of progress bars is it.
  • 7 secs, 0.41 GiB — lab-02's capture_cost, in physical units: 46 captures' worth of warmup+record, and the shared graph memory pool. Amortized over millions of steps; but on a CI box that boots vLLM per test, 7 seconds × every test is real money — which is why enforce_eager=True is the standard test-suite setting upstream while being wrong for production. Same knob, opposite verdicts, both derivable from lab-02.

Hitchhiker's notes

  • Extrapolating to big models: a 70B's decode step is tens of ms of GPU work — the launch bill is a far smaller fraction, so expect graph gains in single-digit percents at moderate batch, not 2.5×. Graphs matter most for small models, small batches, long generations — which, conveniently, describes draft models in speculative decoding (Phase 8), where graphs are practically mandatory.
  • enforce_eager=True disables compilation too, so this A/B bundles two effects (fused kernels + graphs). For the isolated graph effect, compare cudagraph_mode=NONE with compilation on vs the default. The bundle is what operators actually toggle, hence the lab measures the bundle — but know what's in the box before attributing the delta.
  • Variance discipline: tok/s from a single generate call includes engine startup effects, first-iteration warmup, and timer jitter. The captured numbers are representative, not sacred — your L4 will differ by a few percent, your 4090 by more. What must reproduce is the shape: big ratio at 1, monotone fade, no inversion. If your shape differs, that's interesting; investigate (background processes, thermal throttling, a different default mode).
  • When is enforce_eager right in production? Debugging (eager stack traces point at real lines; graph replays don't), extreme memory pressure (reclaim the graph pool's GiB), or genuinely chaotic shapes beyond the ladder. Rare — but "what's the escape hatch and what does it cost" is exactly the question this lab leaves you able to answer with numbers.

Reflect

  • Predict before measuring: on your hardware, will batch-8 land closer to the batch-1 or batch-64 ratio? Which parameter of lab-02's model are you implicitly estimating? (The GPU-work-per-step share — i.e. where batch 8 sits relative to the roofline ridge from Phase 0 lab-04.)
  • The capture log shows 0.41 GiB for 46 graphs of a 125m model. Sketch why a 70B model with tensor parallelism captures in a similar order of memory (graphs store launch topology + workspace, not weights) — and why people are still surprised by the pool's size on memory-tight deployments.
  • Your service restarts pods on every deploy, 50× a day. Quantify the capture tax and name two mitigations. (7 s × 50 = ~6 min/day of cold capacity; mitigate via fewer capture sizes — lab-05's ladder — or vLLM's compilation cache for the compile half; the capture half re-runs regardless.)

References

  • Labs 01–02 — the mechanism and the formulas these numbers validate.
  • Lab-03 — why the capture log has exactly two passes; lab-05 — why each pass has 23.
  • upstream/vllm/v1/worker/gpu_model_runner.py — the capture loop emitting those progress bars.
  • vLLM docs, Optimization and Tuningenforce_eager, cudagraph_mode, compilation knobs: https://docs.vllm.ai/en/latest/configuration/optimization.html
  • Phase 18 — the benchmarking discipline this lab previews (variance, baselines, sweeps).