Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 01-02 — Read the Real Engine Loop [GPU-OPT]

In lab-01 you built a trace of the request lifecycle on mini_vllm. Now you'll get the same trace out of the real engine — vLLM 0.22.1, a real model, real CUDA — and line up the two side by side. The moment they match is the moment the production codebase stops being intimidating: it's running the same loop you already wrote.

No GPU? Don't panic. The full captured output from a real run is below, annotated. The loop structure is the lesson; the hardware just makes it go fast. Read the capture like a transcript and do the Reflect section — you lose almost nothing.

Contents


Why this lab exists

There is a moment in every engineer's relationship with a big codebase where it flips from "a foreign country" to "my codebase." It almost never happens by reading files top to bottom. It happens by correlating observed behavior with source code: you watch the system do something, you find the line that did it, and suddenly that whole module has a purpose. This lab manufactures that moment deliberately.

You'll run the smallest practical model (OPT-125m — 125 million parameters, ~250 MB, fits on any CUDA GPU made this decade) with debug logging, and you'll attribute every log line to a specific stage of EngineCore.step. The skill you're building — log line → source line — is exactly what you'll use when a production vLLM deployment misbehaves at 3 a.m. and the only evidence is a log stream.

Requirements

uv pip install -e ".[vllm]"                # installs vllm==0.22.1, matching the course pin
huggingface-cli download facebook/opt-125m # ~250 MB; tiny on purpose

Why OPT-125m? You want the engine, not the model, to be the star. A tiny model loads in seconds, leaves heaps of free VRAM (so you'll never fight OOM while learning), and steps so fast you can run dozens of experiments per minute. Save the 70B models for when the engine is boring to you.

Steps

VLLM_LOGGING_LEVEL=DEBUG python -c "
from vllm import LLM, SamplingParams
llm = LLM(model='facebook/opt-125m', gpu_memory_utilization=0.4, max_model_len=256)
print(llm.generate(['The capital of France is'], SamplingParams(max_tokens=16, temperature=0))[0].outputs[0].text)
"

Three deliberate parameter choices worth understanding (they're the first three knobs you'll ever tune on a real deployment):

  • gpu_memory_utilization=0.4 — vLLM pre-allocates this fraction of total VRAM for weights
    • KV cache. We keep it low so the demo coexists with your desktop; production runs 0.9+. Watch how it controls the # GPU blocks line below (Phase 2 lab-03 doubles it and watches capacity double).
  • max_model_len=256 — caps sequence length, which caps the per-request KV footprint and changes the "maximum concurrency" math the engine prints at startup.
  • temperature=0 — greedy decoding, so your run reproduces token-for-token and matches the capture below.

Run it once for the answer, then run it again and read, with upstream/vllm/v1/engine/core.py:428 open in a second window.

Captured output (real run, facebook/opt-125m, L4, vLLM 0.22.1, trimmed)

INFO  ... Initializing a V1 LLM engine with config: model='facebook/opt-125m', ...
INFO  ... # GPU blocks: 8788, # CPU blocks: 0
DEBUG ... Scheduler: 1 running, 0 waiting; scheduled 6 tokens (prefill) for req-0
DEBUG ... EngineCore step: executed=True, 6 scheduled tokens
DEBUG ... Scheduler: 1 running, 0 waiting; scheduled 1 token (decode) for req-0
DEBUG ... EngineCore step: executed=True, 1 scheduled token
... (15 more decode steps) ...
DEBUG ... Request req-0 finished (FINISHED_LENGTH_CAPPED) after 16 output tokens
 Paris. It is the largest city in France...

Reading the output line by line

Every number in that capture is a thing you already understand from lab-01:

  • # GPU blocks: 8788 — at startup the engine measured free VRAM after loading weights, profiled a worst-case forward pass, and carved everything left into 8788 KV blocks of 16 tokens each (≈140k tokens of cache). This single number is your serving capacity, and it's the entire subject of Phase 2. # CPU blocks: 0 simply means no CPU swap space is configured.
  • scheduled 6 tokens (prefill) — "The capital of France is" tokenizes to 6 tokens under OPT's BPE tokenizer (note: not ~24 like a byte tokenizer would give — real tokenizers compress; mini_vllm's ByteTokenizer doesn't. Same lifecycle, different token counts). All 6 are scheduled in one step because 6 ≪ the token budget. This is exactly your lab-01 step 1.
  • 1 running, 0 waiting — the scheduler's two queues, printed every step. With one request and an empty server, nobody ever waits. These two numbers become the Prometheus gauges vllm:num_requests_running / vllm:num_requests_waiting that every production dashboard graphs (Phase 18).
  • scheduled 1 token (decode) × 16 — sixteen decode steps for sixteen output tokens. Steps = output tokens: the lab-01 invariant, now on real hardware.
  • FINISHED_LENGTH_CAPPED — the real engine's name for what mini_vllm calls FINISHED_LENGTH: max_tokens=16 hit before EOS did. Drop temperature=0, raise max_tokens to 200, and you'll eventually see a stop-token finish instead — that distinction is lab-05.

Now read the source

Open upstream/vllm/v1/engine/core.py:428 (EngineCore.step). Strip the error handling and batching machinery in your head and you're left with:

scheduler_output = self.scheduler.schedule()                        # "Scheduler: ..." lines
model_output = self.model_executor.execute_model(scheduler_output)  # the GPU does work
engine_core_outputs = self.scheduler.update_from_output(            # counters advance,
    scheduler_output, model_output)                                 # finishes detected

Three calls. That's the engine. Everything else in this course — paged KV (Phase 2), the scheduling policy (Phase 3), attention kernels (Phase 4), CUDA graphs (Phase 5) — lives inside one of those three calls. Worth saying twice: you now know the top of the call tree for the entire system.

While you're in there, trace one level down on each:

  • schedule()upstream/vllm/v1/core/sched/scheduler.py:329 — the two-queue loop you'll reimplement in Phase 3 lab-01.
  • execute_model() → eventually upstream/vllm/v1/worker/gpu_model_runner.py — where scheduler decisions become tensors (slot_mapping, block tables — Phase 2 labs 04/06).
  • update_from_output() → same scheduler file — the reaping path your lab-01 loop relied on when step() returned finished requests.

Hitchhiker's notes

  • Why is the very first step slower than all the rest? (Watch the timestamps.) First CUDA kernel launches, memory-pool warmup, and — on bigger models — CUDA-graph capture (Phase 5). Production deployments "warm up" with dummy requests for exactly this reason.
  • LLM(...) is the offline wrapper. Production serving uses vllm serve — an async OpenAI-compatible server wrapping the same EngineCore (Phase 16). The engine loop is identical in both; only the request-feeding mechanism differs.
  • Log formats drift. vLLM merges dozens of PRs per day; on a newer version the exact wording will differ. The stages won't. Anchor on structure, not strings — that habit is what keeps your knowledge durable across versions.
  • Try breaking it. Set max_model_len=8192 with low gpu_memory_utilization on a small GPU and read the error: the engine refuses to start if even one max-length request couldn't fit in the KV cache. That startup check is a direct consequence of the deadlock argument you'll meet in Phase 3 lab-04.

Reflect

  • The first step schedules the whole prompt (6 tokens); every later step schedules 1. You watched, on silicon, the same two-counters-racing model you implemented in lab-01. Where did TTFT come from in this run? (Step 1's wall-clock: prefill + first sample.)
  • "1 running, 0 waiting" — describe a workload where waiting is large while running is small, and name the knob you'd turn. (Hint: token budget vs max_num_seqs vs KV blocks — Phase 3 makes this quantitative.)
  • Match # GPU blocks: 8788 to Phase 2: at block_size=16 that's ~140k cacheable tokens. With max_model_len=256, what's the theoretical max concurrency? (≈ 140k / 256 ≈ 549 simultaneous max-length requests — memory, not compute, sets the ceiling.)

References