Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 01-01 — Trace a Request's Lifecycle [CPU-OK]

"The first thing a systems engineer does with a black box is make it stop being one."

You are going to take one request — a single prompt — and watch every heartbeat of its life inside an inference engine: from the moment it's admitted, through its prefill, through every decode step, until the engine pronounces it finished. By the end you will have produced, with your own code, the exact trace that vLLM emits when you run it with VLLM_LOGGING_LEVEL=DEBUG — and more importantly, you'll know why every line of that trace looks the way it does.

Contents


Why this lab exists

Every hard debugging session you will ever have on an inference engine — a stuck request, a latency cliff, a throughput regression, a preemption storm — starts with the same question: "what is the engine doing with my request right now?" If you can't answer that, you're guessing. If you can, you're an engineer.

The trouble is that production engines hide the lifecycle behind a convenience API. You call llm.generate(...), you get text back, and the thousand scheduling decisions in between are invisible. This lab removes the convenience wrapper. You will drive the engine one step at a time with your own loop, and after each step you'll photograph the request's state: its status, how many of its tokens have been computed, how many exist in total.

That photograph sequence is the request lifecycle. Once you've built it yourself, the real engine's debug logs (lab-02), its Prometheus metrics (vllm:num_requests_running, vllm:num_requests_waiting — Phase 18), and its scheduler internals (Phase 3) all become readable at a glance, because they're all just different projections of the same state machine you're about to instrument.

Background: the one mental model to rule them all

Here is the single most important idea in this whole phase, the one the real vLLM scheduler is built on (see the comment block at upstream/vllm/v1/core/sched/scheduler.py:330):

There is no "prefill phase" and no "decode phase." A request is just two counters racing: num_computed_tokens chasing num_tokens.

  • num_tokens = prompt tokens + tokens generated so far. It grows by 1 every time the request samples a new token.
  • num_computed_tokens = how many of those tokens have had their KV (attention key/value) computed and stored in the cache. The model can only sample a new token when this counter has caught up — when every existing token's KV is in place.

"Prefill" is merely the situation where num_computed_tokens is far behind (the whole prompt's KV is missing) and the engine computes a big batch of it at once. "Decode" is the situation where it's exactly one behind, and each step computes one token's KV and samples one new token. The same loop handles both. This is what makes chunked prefill (Phase 3), prefix caching (Phases 2–3), and continuous batching fall out naturally instead of being special cases — and it's the single design decision that most distinguishes vLLM's V1 engine from a naive two-phase implementation.

The lifecycle states you'll observe (from mini_vllm/request.py, mirroring upstream/vllm/v1/request.py):

                 add_request()            scheduled              stop condition
   (created) ───────────────▶ WAITING ───────────────▶ RUNNING ───────────────▶ FINISHED_*
                                  ▲                       │
                                  │     memory pressure   │
                                  └──── PREEMPTED ◀───────┘        (Phase 3, lab-04)

FINISHED_* is two states in practice: FINISHED_STOPPED (hit the EOS token) and FINISHED_LENGTH (hit max_tokens). Lab-05 dissects that distinction; in this lab we pin ignore_eos=True so length is always the stop reason and the trace is deterministic.

Files

  • starter.py — implement trace_request (the manual step loop + snapshotting). Your work.
  • solution.py — a complete reference. Resist opening it until your tests pass or you're genuinely stuck; the value of the lab is the 20 minutes of thinking.
  • test_lab.py — pins the lifecycle shape: prefill-in-step-1, monotonic counters, one-token-per-decode, finish-at-cap.

Run

# Test YOUR implementation:
LAB_IMPL=starter pytest phase-01-architecture-and-request-lifecycle/labs/lab-01-trace-a-request -q

# Test the reference (default — this is why the suite is green out of the box):
pytest phase-01-architecture-and-request-lifecycle/labs/lab-01-trace-a-request -q

What to implement

def trace_request(prompt: str, max_tokens: int = 4, **engine_kwargs) -> list[Event]

where Event is (request_id, status, num_computed_tokens, num_tokens). The recipe:

  1. Build an LLMEngine(**engine_kwargs) and add one request with SamplingParams(max_tokens=max_tokens, temperature=0.0, ignore_eos=True). (Greedy + ignore-EOS = a fully deterministic, fixed-length run. Determinism is not a nicety here — it's what makes the lifecycle testable.)
  2. Loop while eng.scheduler.has_unfinished_requests(): calling eng.step() yourself. This is the whole trick: generate() would run this loop for you and hide everything; you are running it by hand so you can look between the steps.
  3. After each step, snapshot every request in eng.scheduler.running, then every request in the list step() returned (those just finished and have already been removed from running — if you only look at running, the final FINISHED_LENGTH event vanishes. This is the classic observability bug: the most interesting state transition is the one that removes the thing you're observing).

See 02-mini-build.md for the engine's anatomy if you haven't built it yet.

What you should see — and why every number is what it is

For trace_request("hello", max_tokens=4) your event list should look like this:

Event(request_id='req-0', status='RUNNING',         num_computed_tokens=5, num_tokens=6)
Event(request_id='req-0', status='RUNNING',         num_computed_tokens=6, num_tokens=7)
Event(request_id='req-0', status='RUNNING',         num_computed_tokens=7, num_tokens=8)
Event(request_id='req-0', status='FINISHED_LENGTH', num_computed_tokens=8, num_tokens=9)

Every number above is explainable, and being able to explain it is the point:

  • Why does the first event already say num_computed_tokens=5? "hello" is 5 bytes, and mini_vllm's ByteTokenizer is one token per byte, so the prompt is 5 tokens. The prompt easily fits the scheduler's token budget (default 2048), so the entire prefill happens inside step 1. You never observe num_computed_tokens < 5 because there is no "between" to observe — the counter goes 0 → 5 inside one step. (Make the prompt longer than the budget, or set long_prefill_token_threshold, and you will see intermediate values. Try it. That's chunked prefill, and it's lab 03-02.)
  • Why is num_tokens=6 in that same first event? Because step 1 didn't just prefill — the prefill caught up (num_computed == num_tokens was about to hold), so the model sampled token #1 in the same step. Prompt (5) + 1 output = 6. Prefill and first-token generation are one step, which is why TTFT (time-to-first-token) ≈ prefill time in every serving benchmark you'll ever read.
  • Why does each subsequent event advance both counters by exactly 1? That's a decode step: compute KV for the one new token, sample the next. One in, one out, forever — this lockstep is why decode is memory-bandwidth-bound (you re-read all the weights to produce a single token per request; see the roofline discussion in Phase 18).
  • Why does it finish at num_tokens=9 and not 10? max_tokens=4 counts output tokens: 5 prompt + 4 output = 9. The status is FINISHED_LENGTH because we set ignore_eos=True — the request was always going to run to its cap.
  • Why are there exactly 4 events? One snapshot per step, and the run takes exactly max_tokens steps: 1 step of (prefill + first token) and 3 pure decode steps. Burn this formula in: steps = max_tokens when the prompt fits one prefill. A 1000-token answer is a thousand trips around the engine loop. That is why decode dominates serving cost, and why everything from CUDA graphs (Phase 5) to speculative decoding (Phase 8) exists.
  • Notice what you never see: WAITING. The request is admitted in the very first schedule() because the engine is empty. On a loaded server, requests queue in WAITING — and time spent there is pure user-visible latency that no kernel optimization can fix. You'll create real WAITING time in lab-04 by starving the token budget.

What the tests prove

TestThe invariant it pinsWhy a maintainer cares
test_first_event_is_running_after_prefillStep 1 completes the prompt's KV (computed == 5)TTFT = prefill; admission happens in schedule(), not add_request()
test_counters_monotonic_and_decode_by_onenum_computed_tokens never decreases; decode advances it by exactly 1A counter going backwards means preemption (Phase 3) — in this lab it would mean your loop is broken
test_finishes_at_length_capTerminal status starts with FINISHEDFinished requests must leave running and free their KV — the reaping path
test_total_decode_steps_equals_max_tokensExactly max_tokens steps for a budget-fitting promptThe steps = output-tokens equivalence underlying every latency model

How this maps to the real engine

Open upstream/vllm/v1/engine/core.py:428 (EngineCore.step) next to your loop. The correspondence is one-to-one:

Your loopReal engineWhat it does
eng.step() calls scheduler.schedule()self.scheduler.schedule()Decide which requests compute how many tokens this step
model forward + sampler inside step()self.model_executor.execute_model(...)Run the GPU forward pass, sample
scheduler.update_from_output(...)self.scheduler.update_from_output(...)Advance counters, detect stops, reap finished
your events.append(...)VLLM_LOGGING_LEVEL=DEBUG log lines / Prometheus gaugesObservability

The real Request (upstream/vllm/v1/request.py) carries the same two counters with the same names. The real RequestStatus has the same states plus a few you'll meet later (FINISHED_ABORTED, FINISHED_IGNORED). When you read the V1 scheduler in Phase 3, you'll recognize every field because you traced it here first.

Hitchhiker's notes (gotchas & deeper cuts)

  • Don't snapshot before the first step. Between add_request() and the first schedule(), the request is WAITING with num_computed_tokens=0 — real, but the tests deliberately start observation after step 1, because that's when the engine has actually done something. If you want the WAITING event, add it; just know why the tests don't ask for it.
  • The finished request is not in scheduler.running. update_from_output reaps it before step() returns. That's why step() returns the finished list — it's your only handle on them. Real vLLM has the same shape: finished requests come back in EngineCoreOutputs, not in the scheduler's queues.
  • Why temperature=0.0? The toy model is deterministic given (last token, position), and greedy sampling makes the whole token stream reproducible. With temperature > 0 the lifecycle shape would be identical but the test for exact step counts could break if a sampled EOS sneaked in. Determinism first, then realism — a good habit for engine tests generally (the real vLLM test suite leans hard on greedy for the same reason).
  • One step ≠ one token, in general. It's one scheduling quantum. This lab's prompt fits in one chunk so steps and output tokens align; chunked prefill (Phase 3) breaks that alignment on purpose. If you internalize "step = the engine's clock tick, in which each scheduled request advances num_computed_tokens by some amount," nothing later will surprise you.

Going further

  • Re-run with long_prefill_token_threshold=2 and a 10-char prompt. You should now see RUNNING events with num_computed_tokens at 2, 4, 6, 8, 10 — and crucially, num_tokens not growing during those steps (mid-prefill steps emit no token; see Scheduler.needs_sample). You've just watched chunked prefill with your own eyes, two phases early.
  • Trace two requests at once (pass a second prompt). Watch them interleave within the same steps — that's continuous batching, and it's lab-04.
  • Compute TTFT and ITL (inter-token latency) in steps from your event list. On real hardware each step has a wall-clock cost roughly proportional to its scheduled token count; your step trace is the skeleton of every latency benchmark in Phase 18.

References

  • upstream/vllm/v1/engine/core.py:428EngineCore.step, the loop you reproduced.
  • upstream/vllm/v1/request.py — the real Request and RequestStatus.
  • upstream/vllm/v1/core/sched/scheduler.py:330 — the "no prefill phase, no decode phase" comment this lab is built around.
  • Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023) — §3 describes the request lifecycle this trace makes visible. https://arxiv.org/abs/2309.06180
  • Yu et al., Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022) — the paper that introduced iteration-level (per-step) scheduling, i.e. the reason your trace advances per step and not per request. https://www.usenix.org/conference/osdi22/presentation/yu
  • vLLM blog, vLLM V1: A Major Upgrade to vLLM's Core Architecture (Jan 2025) — the V1 engine-loop redesign you're tracing. https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
  • kipply, Transformer Inference Arithmetic — why prefill is compute-bound and decode is bandwidth-bound, the physics behind your step counts. https://kipp.ly/transformer-inference-arithmetic/