Lab 01-01 — Trace a Request's Lifecycle [CPU-OK]
"The first thing a systems engineer does with a black box is make it stop being one."
You are going to take one request — a single prompt — and watch every heartbeat of its life
inside an inference engine: from the moment it's admitted, through its prefill, through every
decode step, until the engine pronounces it finished. By the end you will have produced, with
your own code, the exact trace that vLLM emits when you run it with
VLLM_LOGGING_LEVEL=DEBUG — and more importantly, you'll know why every line of that trace
looks the way it does.
Contents
- Why this lab exists
- Background: the one mental model to rule them all
- Files
- Run
- What to implement
- What you should see — and why every number is what it is
- What the tests prove
- How this maps to the real engine
- Hitchhiker's notes (gotchas & deeper cuts)
- Going further
- References
Why this lab exists
Every hard debugging session you will ever have on an inference engine — a stuck request, a latency cliff, a throughput regression, a preemption storm — starts with the same question: "what is the engine doing with my request right now?" If you can't answer that, you're guessing. If you can, you're an engineer.
The trouble is that production engines hide the lifecycle behind a convenience API. You call
llm.generate(...), you get text back, and the thousand scheduling decisions in between are
invisible. This lab removes the convenience wrapper. You will drive the engine one step at
a time with your own loop, and after each step you'll photograph the request's state:
its status, how many of its tokens have been computed, how many exist in total.
That photograph sequence is the request lifecycle. Once you've built it yourself, the real
engine's debug logs (lab-02), its Prometheus metrics (vllm:num_requests_running,
vllm:num_requests_waiting — Phase 18), and its scheduler internals (Phase 3) all become
readable at a glance, because they're all just different projections of the same state
machine you're about to instrument.
Background: the one mental model to rule them all
Here is the single most important idea in this whole phase, the one the real vLLM scheduler
is built on (see the comment block at upstream/vllm/v1/core/sched/scheduler.py:330):
There is no "prefill phase" and no "decode phase." A request is just two counters racing:
num_computed_tokenschasingnum_tokens.
num_tokens= prompt tokens + tokens generated so far. It grows by 1 every time the request samples a new token.num_computed_tokens= how many of those tokens have had their KV (attention key/value) computed and stored in the cache. The model can only sample a new token when this counter has caught up — when every existing token's KV is in place.
"Prefill" is merely the situation where num_computed_tokens is far behind (the whole
prompt's KV is missing) and the engine computes a big batch of it at once. "Decode" is the
situation where it's exactly one behind, and each step computes one token's KV and samples
one new token. The same loop handles both. This is what makes chunked prefill (Phase 3),
prefix caching (Phases 2–3), and continuous batching fall out naturally instead of being
special cases — and it's the single design decision that most distinguishes vLLM's V1 engine
from a naive two-phase implementation.
The lifecycle states you'll observe (from mini_vllm/request.py, mirroring
upstream/vllm/v1/request.py):
add_request() scheduled stop condition
(created) ───────────────▶ WAITING ───────────────▶ RUNNING ───────────────▶ FINISHED_*
▲ │
│ memory pressure │
└──── PREEMPTED ◀───────┘ (Phase 3, lab-04)
FINISHED_* is two states in practice: FINISHED_STOPPED (hit the EOS token) and
FINISHED_LENGTH (hit max_tokens). Lab-05 dissects that distinction; in this lab we pin
ignore_eos=True so length is always the stop reason and the trace is deterministic.
Files
starter.py— implementtrace_request(the manual step loop + snapshotting). Your work.solution.py— a complete reference. Resist opening it until your tests pass or you're genuinely stuck; the value of the lab is the 20 minutes of thinking.test_lab.py— pins the lifecycle shape: prefill-in-step-1, monotonic counters, one-token-per-decode, finish-at-cap.
Run
# Test YOUR implementation:
LAB_IMPL=starter pytest phase-01-architecture-and-request-lifecycle/labs/lab-01-trace-a-request -q
# Test the reference (default — this is why the suite is green out of the box):
pytest phase-01-architecture-and-request-lifecycle/labs/lab-01-trace-a-request -q
What to implement
def trace_request(prompt: str, max_tokens: int = 4, **engine_kwargs) -> list[Event]
where Event is (request_id, status, num_computed_tokens, num_tokens). The recipe:
- Build an
LLMEngine(**engine_kwargs)and add one request withSamplingParams(max_tokens=max_tokens, temperature=0.0, ignore_eos=True). (Greedy + ignore-EOS = a fully deterministic, fixed-length run. Determinism is not a nicety here — it's what makes the lifecycle testable.) - Loop
while eng.scheduler.has_unfinished_requests():callingeng.step()yourself. This is the whole trick:generate()would run this loop for you and hide everything; you are running it by hand so you can look between the steps. - After each step, snapshot every request in
eng.scheduler.running, then every request in the liststep()returned (those just finished and have already been removed fromrunning— if you only look atrunning, the finalFINISHED_LENGTHevent vanishes. This is the classic observability bug: the most interesting state transition is the one that removes the thing you're observing).
See 02-mini-build.md for the engine's anatomy if you haven't built it yet.
What you should see — and why every number is what it is
For trace_request("hello", max_tokens=4) your event list should look like this:
Event(request_id='req-0', status='RUNNING', num_computed_tokens=5, num_tokens=6)
Event(request_id='req-0', status='RUNNING', num_computed_tokens=6, num_tokens=7)
Event(request_id='req-0', status='RUNNING', num_computed_tokens=7, num_tokens=8)
Event(request_id='req-0', status='FINISHED_LENGTH', num_computed_tokens=8, num_tokens=9)
Every number above is explainable, and being able to explain it is the point:
- Why does the first event already say
num_computed_tokens=5?"hello"is 5 bytes, andmini_vllm'sByteTokenizeris one token per byte, so the prompt is 5 tokens. The prompt easily fits the scheduler's token budget (default 2048), so the entire prefill happens inside step 1. You never observenum_computed_tokens < 5because there is no "between" to observe — the counter goes 0 → 5 inside one step. (Make the prompt longer than the budget, or setlong_prefill_token_threshold, and you will see intermediate values. Try it. That's chunked prefill, and it's lab 03-02.) - Why is
num_tokens=6in that same first event? Because step 1 didn't just prefill — the prefill caught up (num_computed == num_tokenswas about to hold), so the model sampled token #1 in the same step. Prompt (5) + 1 output = 6. Prefill and first-token generation are one step, which is why TTFT (time-to-first-token) ≈ prefill time in every serving benchmark you'll ever read. - Why does each subsequent event advance both counters by exactly 1? That's a decode step: compute KV for the one new token, sample the next. One in, one out, forever — this lockstep is why decode is memory-bandwidth-bound (you re-read all the weights to produce a single token per request; see the roofline discussion in Phase 18).
- Why does it finish at
num_tokens=9and not 10?max_tokens=4counts output tokens: 5 prompt + 4 output = 9. The status isFINISHED_LENGTHbecause we setignore_eos=True— the request was always going to run to its cap. - Why are there exactly 4 events? One snapshot per step, and the run takes exactly
max_tokenssteps: 1 step of (prefill + first token) and 3 pure decode steps. Burn this formula in: steps = max_tokens when the prompt fits one prefill. A 1000-token answer is a thousand trips around the engine loop. That is why decode dominates serving cost, and why everything from CUDA graphs (Phase 5) to speculative decoding (Phase 8) exists. - Notice what you never see:
WAITING. The request is admitted in the very firstschedule()because the engine is empty. On a loaded server, requests queue in WAITING — and time spent there is pure user-visible latency that no kernel optimization can fix. You'll create real WAITING time in lab-04 by starving the token budget.
What the tests prove
| Test | The invariant it pins | Why a maintainer cares |
|---|---|---|
test_first_event_is_running_after_prefill | Step 1 completes the prompt's KV (computed == 5) | TTFT = prefill; admission happens in schedule(), not add_request() |
test_counters_monotonic_and_decode_by_one | num_computed_tokens never decreases; decode advances it by exactly 1 | A counter going backwards means preemption (Phase 3) — in this lab it would mean your loop is broken |
test_finishes_at_length_cap | Terminal status starts with FINISHED | Finished requests must leave running and free their KV — the reaping path |
test_total_decode_steps_equals_max_tokens | Exactly max_tokens steps for a budget-fitting prompt | The steps = output-tokens equivalence underlying every latency model |
How this maps to the real engine
Open upstream/vllm/v1/engine/core.py:428 (EngineCore.step) next to your loop. The
correspondence is one-to-one:
| Your loop | Real engine | What it does |
|---|---|---|
eng.step() calls scheduler.schedule() | self.scheduler.schedule() | Decide which requests compute how many tokens this step |
model forward + sampler inside step() | self.model_executor.execute_model(...) | Run the GPU forward pass, sample |
scheduler.update_from_output(...) | self.scheduler.update_from_output(...) | Advance counters, detect stops, reap finished |
your events.append(...) | VLLM_LOGGING_LEVEL=DEBUG log lines / Prometheus gauges | Observability |
The real Request (upstream/vllm/v1/request.py) carries the same two counters with the
same names. The real RequestStatus has the same states plus a few you'll meet later
(FINISHED_ABORTED, FINISHED_IGNORED). When you read the V1 scheduler in Phase 3, you'll
recognize every field because you traced it here first.
Hitchhiker's notes (gotchas & deeper cuts)
- Don't snapshot before the first step. Between
add_request()and the firstschedule(), the request is WAITING withnum_computed_tokens=0— real, but the tests deliberately start observation after step 1, because that's when the engine has actually done something. If you want the WAITING event, add it; just know why the tests don't ask for it. - The finished request is not in
scheduler.running.update_from_outputreaps it beforestep()returns. That's whystep()returns the finished list — it's your only handle on them. Real vLLM has the same shape: finished requests come back inEngineCoreOutputs, not in the scheduler's queues. - Why
temperature=0.0? The toy model is deterministic given (last token, position), and greedy sampling makes the whole token stream reproducible. With temperature > 0 the lifecycle shape would be identical but the test for exact step counts could break if a sampled EOS sneaked in. Determinism first, then realism — a good habit for engine tests generally (the real vLLM test suite leans hard on greedy for the same reason). - One step ≠ one token, in general. It's one scheduling quantum. This lab's prompt
fits in one chunk so steps and output tokens align; chunked prefill (Phase 3) breaks that
alignment on purpose. If you internalize "step = the engine's clock tick, in which each
scheduled request advances
num_computed_tokensby some amount," nothing later will surprise you.
Going further
- Re-run with
long_prefill_token_threshold=2and a 10-char prompt. You should now seeRUNNINGevents withnum_computed_tokensat 2, 4, 6, 8, 10 — and crucially,num_tokensnot growing during those steps (mid-prefill steps emit no token; seeScheduler.needs_sample). You've just watched chunked prefill with your own eyes, two phases early. - Trace two requests at once (pass a second prompt). Watch them interleave within the same steps — that's continuous batching, and it's lab-04.
- Compute TTFT and ITL (inter-token latency) in steps from your event list. On real hardware each step has a wall-clock cost roughly proportional to its scheduled token count; your step trace is the skeleton of every latency benchmark in Phase 18.
References
upstream/vllm/v1/engine/core.py:428—EngineCore.step, the loop you reproduced.upstream/vllm/v1/request.py— the realRequestandRequestStatus.upstream/vllm/v1/core/sched/scheduler.py:330— the "no prefill phase, no decode phase" comment this lab is built around.- Kwon et al., Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023) — §3 describes the request lifecycle this trace makes visible. https://arxiv.org/abs/2309.06180
- Yu et al., Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022) — the paper that introduced iteration-level (per-step) scheduling, i.e. the reason your trace advances per step and not per request. https://www.usenix.org/conference/osdi22/presentation/yu
- vLLM blog, vLLM V1: A Major Upgrade to vLLM's Core Architecture (Jan 2025) — the V1 engine-loop redesign you're tracing. https://blog.vllm.ai/2025/01/27/v1-alpha-release.html
- kipply, Transformer Inference Arithmetic — why prefill is compute-bound and decode is bandwidth-bound, the physics behind your step counts. https://kipp.ly/transformer-inference-arithmetic/