Phase 01 Labs — Architecture & Request Lifecycle
Five labs that turn the engine from a black box into your box. The arc: observe the lifecycle (lab-01), verify it on real hardware (lab-02), rebuild the loop yourself (lab-03), watch many requests share it (lab-04), and master how requests end (lab-05). Do them in order — each one's vocabulary is the next one's prerequisite.
Every [CPU-OK] lab follows the same contract: starter.py with TODOs (your work),
solution.py (the reference), test_lab.py (the spec, executable). The default test run
uses solution.py so the suite is always green; set LAB_IMPL=starter to grade yourself.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-01-architecture-and-request-lifecycle/labs -m "not gpu"
# Grade your own work on one lab:
LAB_IMPL=starter pytest phase-01-architecture-and-request-lifecycle/labs/lab-01-trace-a-request -q
Contents
- lab-01-trace-a-request
[CPU-OK] - lab-02-read-the-real-loop
[GPU-OPT] - lab-03-engine-step-by-hand
[CPU-OK] - lab-04-watch-the-batch
[CPU-OK] - lab-05-stop-conditions
[CPU-OK] - What you can do after this phase
Labs
lab-01-trace-a-request [CPU-OK]
Drive the mini_vllm engine one step at a time and record every transition of a single
request — status, num_computed_tokens, num_tokens — from prefill through decode to
finish. You'll reconstruct, on an engine you control, exactly what
VLLM_LOGGING_LEVEL=DEBUG prints on the real one, and internalize the course's central
mental model: a request is two counters racing. Skills: the lifecycle state machine;
prefill/decode as one mechanism; TTFT = step 1.
lab-02-read-the-real-loop [GPU-OPT]
Run real vLLM 0.22.1 on a tiny model with debug logging and attribute every log line to a
stage of EngineCore.step (core.py:428). The lab-01 trace and the production log line up
one-to-one — that correlation is the moment the upstream codebase becomes readable. Captured,
annotated output included so the lab works without a GPU. Skills: log-line → source-line
debugging; the three-call engine core; # GPU blocks as serving capacity.
lab-03-engine-step-by-hand [CPU-OK]
The rite of passage: given the engine's organs (scheduler, model, sampler), wire the
schedule → execute → sample → update loop yourself, and prove it token-for-token identical
to LLMEngine.step. Includes the one subtle rule of the whole loop — only requests whose
computed tokens catch up this step may sample — with a test that catches you if you miss
it. Skills: the engine's stage contract; the needs_sample invariant; testing by
determinism.
lab-04-watch-the-batch [CPU-OK]
Instrument the scheduler with a non-invasive probe and record the batch composition of every step while multiple requests run under a scarce token budget. You'll see prefill chunks and decodes co-scheduled in one step, requests joining and leaving the batch mid-flight — continuous batching, measured rather than described. Skills: the observe-don't-modify probe pattern; budget/chunk/defer mechanics; token-conservation identities for debugging schedulers.
lab-05-stop-conditions [CPU-OK]
Dissect how requests end: EOS ("stop") vs max_tokens ("length"), the ignore_eos
benchmark flag, and the boundary tie where both fire at once and the order of two if
statements becomes a public API. Scripted token streams make every edge case exactly
testable. Skills: status → finish_reason mapping; why ordering of stop checks is an API
decision; triaging "my answer got cut off."
What you can do after this phase
Explain to a colleague, with evidence you generated yourself: what one step of an inference
engine does; why TTFT ≈ prefill and ITL ≈ one decode step; how N requests share an engine
without ever stopping it; and what finish_reason will say and why. You also now hold the
top of vLLM's call tree (EngineCore.step) in your head — every later phase is a descent
into one of its three calls.