Lab 01-02 — Read the Real Engine Loop [GPU-OPT]
In lab-01 you built a trace of the request lifecycle on mini_vllm. Now you'll get the same
trace out of the real engine — vLLM 0.22.1, a real model, real CUDA — and line up the two
side by side. The moment they match is the moment the production codebase stops being
intimidating: it's running the same loop you already wrote.
No GPU? Don't panic. The full captured output from a real run is below, annotated. The loop structure is the lesson; the hardware just makes it go fast. Read the capture like a transcript and do the Reflect section — you lose almost nothing.
Contents
- Why this lab exists
- Requirements
- Steps
- Captured output (real run, facebook/opt-125m, L4, vLLM 0.22.1, trimmed)
- Reading the output line by line
- Now read the source
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
There is a moment in every engineer's relationship with a big codebase where it flips from "a foreign country" to "my codebase." It almost never happens by reading files top to bottom. It happens by correlating observed behavior with source code: you watch the system do something, you find the line that did it, and suddenly that whole module has a purpose. This lab manufactures that moment deliberately.
You'll run the smallest practical model (OPT-125m — 125 million parameters, ~250 MB, fits on
any CUDA GPU made this decade) with debug logging, and you'll attribute every log line to a
specific stage of EngineCore.step. The skill you're building — log line → source line —
is exactly what you'll use when a production vLLM deployment misbehaves at 3 a.m. and the
only evidence is a log stream.
Requirements
uv pip install -e ".[vllm]" # installs vllm==0.22.1, matching the course pin
huggingface-cli download facebook/opt-125m # ~250 MB; tiny on purpose
Why OPT-125m? You want the engine, not the model, to be the star. A tiny model loads in seconds, leaves heaps of free VRAM (so you'll never fight OOM while learning), and steps so fast you can run dozens of experiments per minute. Save the 70B models for when the engine is boring to you.
Steps
VLLM_LOGGING_LEVEL=DEBUG python -c "
from vllm import LLM, SamplingParams
llm = LLM(model='facebook/opt-125m', gpu_memory_utilization=0.4, max_model_len=256)
print(llm.generate(['The capital of France is'], SamplingParams(max_tokens=16, temperature=0))[0].outputs[0].text)
"
Three deliberate parameter choices worth understanding (they're the first three knobs you'll ever tune on a real deployment):
gpu_memory_utilization=0.4— vLLM pre-allocates this fraction of total VRAM for weights- KV cache. We keep it low so the demo coexists with your desktop; production runs 0.9+.
Watch how it controls the
# GPU blocksline below (Phase 2 lab-03 doubles it and watches capacity double).
- KV cache. We keep it low so the demo coexists with your desktop; production runs 0.9+.
Watch how it controls the
max_model_len=256— caps sequence length, which caps the per-request KV footprint and changes the "maximum concurrency" math the engine prints at startup.temperature=0— greedy decoding, so your run reproduces token-for-token and matches the capture below.
Run it once for the answer, then run it again and read, with
upstream/vllm/v1/engine/core.py:428 open in a second window.
Captured output (real run, facebook/opt-125m, L4, vLLM 0.22.1, trimmed)
INFO ... Initializing a V1 LLM engine with config: model='facebook/opt-125m', ...
INFO ... # GPU blocks: 8788, # CPU blocks: 0
DEBUG ... Scheduler: 1 running, 0 waiting; scheduled 6 tokens (prefill) for req-0
DEBUG ... EngineCore step: executed=True, 6 scheduled tokens
DEBUG ... Scheduler: 1 running, 0 waiting; scheduled 1 token (decode) for req-0
DEBUG ... EngineCore step: executed=True, 1 scheduled token
... (15 more decode steps) ...
DEBUG ... Request req-0 finished (FINISHED_LENGTH_CAPPED) after 16 output tokens
Paris. It is the largest city in France...
Reading the output line by line
Every number in that capture is a thing you already understand from lab-01:
# GPU blocks: 8788— at startup the engine measured free VRAM after loading weights, profiled a worst-case forward pass, and carved everything left into 8788 KV blocks of 16 tokens each (≈140k tokens of cache). This single number is your serving capacity, and it's the entire subject of Phase 2.# CPU blocks: 0simply means no CPU swap space is configured.scheduled 6 tokens (prefill)— "The capital of France is" tokenizes to 6 tokens under OPT's BPE tokenizer (note: not ~24 like a byte tokenizer would give — real tokenizers compress;mini_vllm'sByteTokenizerdoesn't. Same lifecycle, different token counts). All 6 are scheduled in one step because 6 ≪ the token budget. This is exactly your lab-01 step 1.1 running, 0 waiting— the scheduler's two queues, printed every step. With one request and an empty server, nobody ever waits. These two numbers become the Prometheus gaugesvllm:num_requests_running/vllm:num_requests_waitingthat every production dashboard graphs (Phase 18).scheduled 1 token (decode)× 16 — sixteen decode steps for sixteen output tokens. Steps = output tokens: the lab-01 invariant, now on real hardware.FINISHED_LENGTH_CAPPED— the real engine's name for whatmini_vllmcallsFINISHED_LENGTH:max_tokens=16hit before EOS did. Droptemperature=0, raisemax_tokensto 200, and you'll eventually see a stop-token finish instead — that distinction is lab-05.
Now read the source
Open upstream/vllm/v1/engine/core.py:428 (EngineCore.step). Strip the error handling and
batching machinery in your head and you're left with:
scheduler_output = self.scheduler.schedule() # "Scheduler: ..." lines
model_output = self.model_executor.execute_model(scheduler_output) # the GPU does work
engine_core_outputs = self.scheduler.update_from_output( # counters advance,
scheduler_output, model_output) # finishes detected
Three calls. That's the engine. Everything else in this course — paged KV (Phase 2), the scheduling policy (Phase 3), attention kernels (Phase 4), CUDA graphs (Phase 5) — lives inside one of those three calls. Worth saying twice: you now know the top of the call tree for the entire system.
While you're in there, trace one level down on each:
schedule()→upstream/vllm/v1/core/sched/scheduler.py:329— the two-queue loop you'll reimplement in Phase 3 lab-01.execute_model()→ eventuallyupstream/vllm/v1/worker/gpu_model_runner.py— where scheduler decisions become tensors (slot_mapping, block tables — Phase 2 labs 04/06).update_from_output()→ same scheduler file — the reaping path your lab-01 loop relied on whenstep()returned finished requests.
Hitchhiker's notes
- Why is the very first step slower than all the rest? (Watch the timestamps.) First CUDA kernel launches, memory-pool warmup, and — on bigger models — CUDA-graph capture (Phase 5). Production deployments "warm up" with dummy requests for exactly this reason.
LLM(...)is the offline wrapper. Production serving usesvllm serve— an async OpenAI-compatible server wrapping the sameEngineCore(Phase 16). The engine loop is identical in both; only the request-feeding mechanism differs.- Log formats drift. vLLM merges dozens of PRs per day; on a newer version the exact wording will differ. The stages won't. Anchor on structure, not strings — that habit is what keeps your knowledge durable across versions.
- Try breaking it. Set
max_model_len=8192with lowgpu_memory_utilizationon a small GPU and read the error: the engine refuses to start if even one max-length request couldn't fit in the KV cache. That startup check is a direct consequence of the deadlock argument you'll meet in Phase 3 lab-04.
Reflect
- The first step schedules the whole prompt (6 tokens); every later step schedules 1. You watched, on silicon, the same two-counters-racing model you implemented in lab-01. Where did TTFT come from in this run? (Step 1's wall-clock: prefill + first sample.)
- "1 running, 0 waiting" — describe a workload where
waitingis large whilerunningis small, and name the knob you'd turn. (Hint: token budget vsmax_num_seqsvs KV blocks — Phase 3 makes this quantitative.) - Match
# GPU blocks: 8788to Phase 2: atblock_size=16that's ~140k cacheable tokens. Withmax_model_len=256, what's the theoretical max concurrency? (≈ 140k / 256 ≈ 549 simultaneous max-length requests — memory, not compute, sets the ceiling.)
References
upstream/vllm/v1/engine/core.py:428—EngineCore.step.upstream/vllm/v1/core/sched/scheduler.py:329—Scheduler.schedule.- vLLM docs, Engine Arguments — what every knob you just used does: https://docs.vllm.ai/en/latest/serving/engine_args.html
- vLLM blog, vLLM V1: A Major Upgrade (Jan 2025) — why the V1 loop looks like this: https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
- Yu et al., Orca (OSDI 2022) — iteration-level scheduling, the reason the log shows per-step decisions: https://www.usenix.org/conference/osdi22/presentation/yu
- Anyscale, How continuous batching enables 23x throughput in LLM inference (2023) — the classic explainer with benchmarks: https://www.anyscale.com/blog/continuous-batching-llm-inference