Lab 01-03 — Rebuild the Engine Step by Hand [CPU-OK]
This is the rite-of-passage lab of Phase 1. In lab-01 you observed the engine loop from
the outside; now you will write it. You get the engine's organs — a Scheduler, a model,
a Sampler — and you must wire them into one working heartbeat:
schedule → execute → sample → update
When your version produces token-for-token identical output to LLMEngine.step (the
tests check exactly that), you will have personally implemented the function that sits at
the top of vLLM's call tree — the one every other phase of this course lives inside.
Contents
- Why this lab exists
- Background: what a "step" really is
- Files
- Run
- What to implement
- The subtle part: who gets to sample?
- What the tests prove
- How this maps to the real engine
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Reading a loop and being able to write it are different levels of knowledge, and the gap
between them is precisely where maintainers are made. Every nontrivial vLLM PR you will ever
review or write touches the contract between these four stages: the scheduler promises the
executor a batch shape; the executor promises the sampler logits in row order; the sampler
promises the scheduler a token per eligible request; update_from_output promises
everyone that the bookkeeping is consistent before the next tick. Bugs live at these seams.
After this lab, the seams are yours.
There's a second, sneakier payoff. By taking the engine apart and reassembling it, you
prove to yourself that LLMEngine contains no magic: it owns its components and runs a
four-line loop. That demystification compounds — when you later read
EngineCore.step upstream and it's 60 lines instead of 4, you'll see immediately that the
extra 56 are batching, async plumbing, and error handling, not new ideas.
Background: what a "step" really is
A step is the engine's clock tick. In one tick:
- schedule — the scheduler looks at every live request and produces a verdict: a map
{request_id: n}meaning "compute KV for the nextntokens of this request, this tick." For a fresh short prompt,n= the whole prompt (prefill). For a request mid-generation,n = 1(decode). For a long prompt under chunked prefill,n= one chunk. The genius of the design is that downstream stages don't care which of those it is. - execute — the model computes the forward pass for all scheduled tokens of all
scheduled requests in one batch. (In
mini_vllmthe toy model only needs each request's last token + position; the real engine feeds every scheduled token through the transformer and writes their KV into the paged cache — Phase 2.) - sample — for each request that caught up this tick (more below), turn its logits row into one new token id.
- update — advance
num_computed_tokensbyn, append sampled tokens, check stop conditions, reap the finished (free their KV, drop fromrunning).
The state at the end of a tick depends only on the state at the start — there is no hidden carry-over. That's why the engine can be paused, traced (lab-01), snapshotted, or driven by a test one tick at a time.
Files
starter.py—engine_step(scheduler, model, sampler)is stubbed with the full recipe in the docstring. Your work.solution.py— reference (mirrorsmini_vllm/engine.py::LLMEngine.step).test_lab.py— equivalence tests against the real engine, plus the mid-prefill edge case.
Run
LAB_IMPL=starter pytest phase-01-architecture-and-request-lifecycle/labs/lab-03-engine-step-by-hand -q
pytest phase-01-architecture-and-request-lifecycle/labs/lab-03-engine-step-by-hand -q # reference
What to implement
def engine_step(scheduler: Scheduler, model: ToyModel, sampler: Sampler) -> list[Request]
One full iteration; returns the requests that finished. You'll find the four stages spelled out in the starter docstring. Budget 30–45 minutes; if it takes longer, re-read 02-mini-build.md — the trouble is almost always stage 2.
The subtle part: who gets to sample?
Stage 2 hides the one genuinely subtle decision in the whole loop, and it's the reason this lab has an edge-case test:
A request emits a token this step iff its computed tokens catch up to all its tokens:
num_computed_tokens + n == num_tokens(that'sScheduler.needs_sample).
Why? Sampling token k+1 requires the logits at position k, which require the KV of all
tokens 0..k to exist. A mid-prefill chunk (say, tokens 0–3 of a 12-token prompt) computes
useful KV but leaves the request's tail un-computed — sampling now would be sampling from a
model that hasn't read the whole prompt. It would run without crashing, and it would
produce garbage. This is the classic class of inference bug: silently wrong, not loudly
broken. The test test_mid_prefill_chunk_emits_no_token exists so that if you ever forget
the guard, you find out in 50 ms instead of in production.
(The real engine encodes the same rule via logits_indices — the model runner gathers
logits only at each request's last scheduled position and the sampler only sees rows for
requests that caught up. Different mechanism, identical invariant.)
What the tests prove
| Test | What it pins |
|---|---|
test_single_request_matches_reference | Your loop = the engine's loop, simplest case |
test_batch_matches_reference | Row ordering: logits row i must go to scheduled request i. Shuffle them and tokens cross between requests — the "answer swap" bug that has hit real serving systems |
test_matches_under_chunked_prefill | Your loop survives n < remaining (chunks) and a tight token budget without changing output |
test_mid_prefill_chunk_emits_no_token | The needs_sample guard above |
test_empty_schedule_returns_no_finished | The idle path: an engine with nothing to do must do nothing, gracefully |
The equivalence tests work because everything is deterministic: the toy model's logits are a pure function of (seed, last token, position) and greedy sampling has no randomness. Two engines with the same seed must agree token-for-token — so any disagreement is a bug in your wiring, never noise. Hold on to this technique: determinism turns "looks right" into "provably identical," and it's how vLLM's own correctness tests pin scheduler changes.
How this maps to the real engine
Side by side with upstream/vllm/v1/engine/core.py:428:
| Your line | Upstream | Notes |
|---|---|---|
output = scheduler.schedule() | scheduler_output = self.scheduler.schedule() | Identical role; upstream's output also carries block tables & slot mappings (Phase 2) |
model.forward(last_tokens, positions) | self.model_executor.execute_model(scheduler_output) | Upstream ships the whole batch to GPU workers, possibly across processes/nodes (Phase 10) |
sampler.sample(logits[i], ...) | inside the model runner: self.sampler(...) | Upstream samples on the GPU, vectorized over the batch (Phase 9) |
scheduler.update_from_output(...) | self.scheduler.update_from_output(...) | Same name, same job |
Note what upstream does not do differently: the stage order, the catch-up rule, the reaping path. Architecture survives; implementation details scale.
Hitchhiker's notes
- Order within the batch is a contract.
rows[i]↔logits[i]. Inmini_vllmthis is a Python list; upstream it's tensor row indices (logits_indices). Either way, the scheduler and the sampler are communicating through positional agreement — one of those invisible contracts that only becomes visible when someone breaks it. update_from_outputmust run even on steps where nothing sampled. Mid-prefill steps still advancenum_computed_tokens— that's the whole point of the chunk. If you guard the update behindif sampled:, chunked prefill freezes forever. (Ask us how we know.)- Why does
engine_steptake the scheduler rather than creating one? Dependency injection isn't ceremony here: the tests hand you an engine's organs precisely so they can compare your loop against the engine that owns them. Upstream is shaped the same way for the same reason —EngineCorereceives its executor, which is what lets tests swap in fakes. - The model is fake; the bookkeeping is real.
ToyModelproduces deterministic pseudo-logits and ignores KV contents. Everything you wired — scheduling verdicts, catch-up sampling, reaping — is faithful. This split (real control plane, toy data plane) is the course's core trick, and it's also how you should unit-test engine changes upstream: the control plane rarely needs a GPU to be proven correct.
Going further
- Add a
callback(step_idx, output, sampled)parameter and rebuild lab-01's trace using your own step function. Observability-as-a-hook is exactly how vLLM's stat loggers attach to the loop. - Break the row-order contract on purpose (reverse
rowsbut notlogits) and watch which test catches it and how — the failure is instructive: outputs are plausible-looking tokens, just the wrong ones. - Time 1000 steps of a decode-only batch at batch sizes 1, 8, 64 (
time.perf_counter). Even on CPU with a toy model you'll see per-step overhead amortize with batch size — a small-scale preview of why batching is the first lever of throughput (Phase 18).
References
mini_vllm/engine.py— theLLMEngine.stepyou are reimplementing.upstream/vllm/v1/engine/core.py:428—EngineCore.step.upstream/vllm/v1/worker/gpu_model_runner.py— where execute/sample happen for real; search forlogits_indicesto find the catch-up rule's production form.- Yu et al., Orca (OSDI 2022) — §4, "iteration-level scheduling": the paper that first made "one step of everyone" the unit of work. https://www.usenix.org/conference/osdi22/presentation/yu
- vLLM blog, vLLM V1: A Major Upgrade — the rewrite that flattened the engine loop into the shape you just built: https://blog.vllm.ai/2025/01/27/v1-alpha-release.html