Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 01 Labs — Architecture & Request Lifecycle

Five labs that turn the engine from a black box into your box. The arc: observe the lifecycle (lab-01), verify it on real hardware (lab-02), rebuild the loop yourself (lab-03), watch many requests share it (lab-04), and master how requests end (lab-05). Do them in order — each one's vocabulary is the next one's prerequisite.

Every [CPU-OK] lab follows the same contract: starter.py with TODOs (your work), solution.py (the reference), test_lab.py (the spec, executable). The default test run uses solution.py so the suite is always green; set LAB_IMPL=starter to grade yourself.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-01-architecture-and-request-lifecycle/labs -m "not gpu"

# Grade your own work on one lab:
LAB_IMPL=starter pytest phase-01-architecture-and-request-lifecycle/labs/lab-01-trace-a-request -q

Contents


Labs

lab-01-trace-a-request [CPU-OK]

Drive the mini_vllm engine one step at a time and record every transition of a single request — status, num_computed_tokens, num_tokens — from prefill through decode to finish. You'll reconstruct, on an engine you control, exactly what VLLM_LOGGING_LEVEL=DEBUG prints on the real one, and internalize the course's central mental model: a request is two counters racing. Skills: the lifecycle state machine; prefill/decode as one mechanism; TTFT = step 1.

lab-02-read-the-real-loop [GPU-OPT]

Run real vLLM 0.22.1 on a tiny model with debug logging and attribute every log line to a stage of EngineCore.step (core.py:428). The lab-01 trace and the production log line up one-to-one — that correlation is the moment the upstream codebase becomes readable. Captured, annotated output included so the lab works without a GPU. Skills: log-line → source-line debugging; the three-call engine core; # GPU blocks as serving capacity.

lab-03-engine-step-by-hand [CPU-OK]

The rite of passage: given the engine's organs (scheduler, model, sampler), wire the schedule → execute → sample → update loop yourself, and prove it token-for-token identical to LLMEngine.step. Includes the one subtle rule of the whole loop — only requests whose computed tokens catch up this step may sample — with a test that catches you if you miss it. Skills: the engine's stage contract; the needs_sample invariant; testing by determinism.

lab-04-watch-the-batch [CPU-OK]

Instrument the scheduler with a non-invasive probe and record the batch composition of every step while multiple requests run under a scarce token budget. You'll see prefill chunks and decodes co-scheduled in one step, requests joining and leaving the batch mid-flight — continuous batching, measured rather than described. Skills: the observe-don't-modify probe pattern; budget/chunk/defer mechanics; token-conservation identities for debugging schedulers.

lab-05-stop-conditions [CPU-OK]

Dissect how requests end: EOS ("stop") vs max_tokens ("length"), the ignore_eos benchmark flag, and the boundary tie where both fire at once and the order of two if statements becomes a public API. Scripted token streams make every edge case exactly testable. Skills: status → finish_reason mapping; why ordering of stop checks is an API decision; triaging "my answer got cut off."

What you can do after this phase

Explain to a colleague, with evidence you generated yourself: what one step of an inference engine does; why TTFT ≈ prefill and ITL ≈ one decode step; how N requests share an engine without ever stopping it; and what finish_reason will say and why. You also now hold the top of vLLM's call tree (EngineCore.step) in your head — every later phase is a descent into one of its three calls.