Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 01-03 — Rebuild the Engine Step by Hand [CPU-OK]

This is the rite-of-passage lab of Phase 1. In lab-01 you observed the engine loop from the outside; now you will write it. You get the engine's organs — a Scheduler, a model, a Sampler — and you must wire them into one working heartbeat:

schedule  →  execute  →  sample  →  update

When your version produces token-for-token identical output to LLMEngine.step (the tests check exactly that), you will have personally implemented the function that sits at the top of vLLM's call tree — the one every other phase of this course lives inside.

Contents


Why this lab exists

Reading a loop and being able to write it are different levels of knowledge, and the gap between them is precisely where maintainers are made. Every nontrivial vLLM PR you will ever review or write touches the contract between these four stages: the scheduler promises the executor a batch shape; the executor promises the sampler logits in row order; the sampler promises the scheduler a token per eligible request; update_from_output promises everyone that the bookkeeping is consistent before the next tick. Bugs live at these seams. After this lab, the seams are yours.

There's a second, sneakier payoff. By taking the engine apart and reassembling it, you prove to yourself that LLMEngine contains no magic: it owns its components and runs a four-line loop. That demystification compounds — when you later read EngineCore.step upstream and it's 60 lines instead of 4, you'll see immediately that the extra 56 are batching, async plumbing, and error handling, not new ideas.

Background: what a "step" really is

A step is the engine's clock tick. In one tick:

  1. schedule — the scheduler looks at every live request and produces a verdict: a map {request_id: n} meaning "compute KV for the next n tokens of this request, this tick." For a fresh short prompt, n = the whole prompt (prefill). For a request mid-generation, n = 1 (decode). For a long prompt under chunked prefill, n = one chunk. The genius of the design is that downstream stages don't care which of those it is.
  2. execute — the model computes the forward pass for all scheduled tokens of all scheduled requests in one batch. (In mini_vllm the toy model only needs each request's last token + position; the real engine feeds every scheduled token through the transformer and writes their KV into the paged cache — Phase 2.)
  3. sample — for each request that caught up this tick (more below), turn its logits row into one new token id.
  4. update — advance num_computed_tokens by n, append sampled tokens, check stop conditions, reap the finished (free their KV, drop from running).

The state at the end of a tick depends only on the state at the start — there is no hidden carry-over. That's why the engine can be paused, traced (lab-01), snapshotted, or driven by a test one tick at a time.

Files

  • starter.pyengine_step(scheduler, model, sampler) is stubbed with the full recipe in the docstring. Your work.
  • solution.py — reference (mirrors mini_vllm/engine.py::LLMEngine.step).
  • test_lab.py — equivalence tests against the real engine, plus the mid-prefill edge case.

Run

LAB_IMPL=starter pytest phase-01-architecture-and-request-lifecycle/labs/lab-03-engine-step-by-hand -q
pytest phase-01-architecture-and-request-lifecycle/labs/lab-03-engine-step-by-hand -q   # reference

What to implement

def engine_step(scheduler: Scheduler, model: ToyModel, sampler: Sampler) -> list[Request]

One full iteration; returns the requests that finished. You'll find the four stages spelled out in the starter docstring. Budget 30–45 minutes; if it takes longer, re-read 02-mini-build.md — the trouble is almost always stage 2.

The subtle part: who gets to sample?

Stage 2 hides the one genuinely subtle decision in the whole loop, and it's the reason this lab has an edge-case test:

A request emits a token this step iff its computed tokens catch up to all its tokens: num_computed_tokens + n == num_tokens (that's Scheduler.needs_sample).

Why? Sampling token k+1 requires the logits at position k, which require the KV of all tokens 0..k to exist. A mid-prefill chunk (say, tokens 0–3 of a 12-token prompt) computes useful KV but leaves the request's tail un-computed — sampling now would be sampling from a model that hasn't read the whole prompt. It would run without crashing, and it would produce garbage. This is the classic class of inference bug: silently wrong, not loudly broken. The test test_mid_prefill_chunk_emits_no_token exists so that if you ever forget the guard, you find out in 50 ms instead of in production.

(The real engine encodes the same rule via logits_indices — the model runner gathers logits only at each request's last scheduled position and the sampler only sees rows for requests that caught up. Different mechanism, identical invariant.)

What the tests prove

TestWhat it pins
test_single_request_matches_referenceYour loop = the engine's loop, simplest case
test_batch_matches_referenceRow ordering: logits row i must go to scheduled request i. Shuffle them and tokens cross between requests — the "answer swap" bug that has hit real serving systems
test_matches_under_chunked_prefillYour loop survives n < remaining (chunks) and a tight token budget without changing output
test_mid_prefill_chunk_emits_no_tokenThe needs_sample guard above
test_empty_schedule_returns_no_finishedThe idle path: an engine with nothing to do must do nothing, gracefully

The equivalence tests work because everything is deterministic: the toy model's logits are a pure function of (seed, last token, position) and greedy sampling has no randomness. Two engines with the same seed must agree token-for-token — so any disagreement is a bug in your wiring, never noise. Hold on to this technique: determinism turns "looks right" into "provably identical," and it's how vLLM's own correctness tests pin scheduler changes.

How this maps to the real engine

Side by side with upstream/vllm/v1/engine/core.py:428:

Your lineUpstreamNotes
output = scheduler.schedule()scheduler_output = self.scheduler.schedule()Identical role; upstream's output also carries block tables & slot mappings (Phase 2)
model.forward(last_tokens, positions)self.model_executor.execute_model(scheduler_output)Upstream ships the whole batch to GPU workers, possibly across processes/nodes (Phase 10)
sampler.sample(logits[i], ...)inside the model runner: self.sampler(...)Upstream samples on the GPU, vectorized over the batch (Phase 9)
scheduler.update_from_output(...)self.scheduler.update_from_output(...)Same name, same job

Note what upstream does not do differently: the stage order, the catch-up rule, the reaping path. Architecture survives; implementation details scale.

Hitchhiker's notes

  • Order within the batch is a contract. rows[i]logits[i]. In mini_vllm this is a Python list; upstream it's tensor row indices (logits_indices). Either way, the scheduler and the sampler are communicating through positional agreement — one of those invisible contracts that only becomes visible when someone breaks it.
  • update_from_output must run even on steps where nothing sampled. Mid-prefill steps still advance num_computed_tokens — that's the whole point of the chunk. If you guard the update behind if sampled:, chunked prefill freezes forever. (Ask us how we know.)
  • Why does engine_step take the scheduler rather than creating one? Dependency injection isn't ceremony here: the tests hand you an engine's organs precisely so they can compare your loop against the engine that owns them. Upstream is shaped the same way for the same reason — EngineCore receives its executor, which is what lets tests swap in fakes.
  • The model is fake; the bookkeeping is real. ToyModel produces deterministic pseudo-logits and ignores KV contents. Everything you wired — scheduling verdicts, catch-up sampling, reaping — is faithful. This split (real control plane, toy data plane) is the course's core trick, and it's also how you should unit-test engine changes upstream: the control plane rarely needs a GPU to be proven correct.

Going further

  • Add a callback(step_idx, output, sampled) parameter and rebuild lab-01's trace using your own step function. Observability-as-a-hook is exactly how vLLM's stat loggers attach to the loop.
  • Break the row-order contract on purpose (reverse rows but not logits) and watch which test catches it and how — the failure is instructive: outputs are plausible-looking tokens, just the wrong ones.
  • Time 1000 steps of a decode-only batch at batch sizes 1, 8, 64 (time.perf_counter). Even on CPU with a toy model you'll see per-step overhead amortize with batch size — a small-scale preview of why batching is the first lever of throughput (Phase 18).

References

  • mini_vllm/engine.py — the LLMEngine.step you are reimplementing.
  • upstream/vllm/v1/engine/core.py:428EngineCore.step.
  • upstream/vllm/v1/worker/gpu_model_runner.py — where execute/sample happen for real; search for logits_indices to find the catch-up rule's production form.
  • Yu et al., Orca (OSDI 2022) — §4, "iteration-level scheduling": the paper that first made "one step of everyone" the unit of work. https://www.usenix.org/conference/osdi22/presentation/yu
  • vLLM blog, vLLM V1: A Major Upgrade — the rewrite that flattened the engine loop into the shape you just built: https://blog.vllm.ai/2025/01/27/v1-alpha-release.html