Lab 01-05 — Stop Conditions & Finish Reasons [CPU-OK]
Every request dies. The only questions are when and what we tell the user about it. This
lab dissects the engine's stop machinery — the few lines of update_from_output that decide
whether a generation halts on the model's own EOS token or on the operator's max_tokens
cap — and the mapping from internal status to the finish_reason field that every OpenAI
API consumer in the world branches on.
It looks small. It is small. It is also the part of the engine with the highest bug-impact-to-code-size ratio: an off-by-one or a mis-ordered check here doesn't crash — it silently truncates answers, or streams one token too many, for every user, forever.
Contents
- Why this lab exists
- Background: the three ways a request ends
- Files
- Run
- What to implement
- The edge case the tests are really about
- What the tests prove
- How this maps to the real engine
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Ask anyone who's run an LLM API in production what their most common user-facing bug
report is. It won't be a crash. It will be: "the answer just… cuts off." Triaging that
report requires knowing exactly what you'll know after this lab: was it finish_reason: "length" (the operator's cap — raise max_tokens), "stop" (the model chose to end — a
prompting issue), or a stream that died without a reason (an actual bug)? The distinction is
three enum values and two if statements, and entire support rotations have burned days for
lack of it.
There's an engineering lesson too. Stop handling is where model behavior (EOS is just a
token the model can emit, with a probability like any other) meets system policy
(max_tokens is an admission-control and billing boundary). Keeping those two cleanly
separated — and correctly ordered — is a miniature of the whole serving-systems
discipline.
Background: the three ways a request ends
- The model stops itself — it samples the EOS (end-of-sequence) token. EOS is not
magic: it's a vocabulary entry (id 256 in
mini_vllm'sByteTokenizer; id 2 for Llama;<|endoftext|>= 50256 for GPT-2) that the model learned to emit when a response is complete. The engine checks "was the token just appended the EOS?" and if so marksFINISHED_STOPPED→ APIfinish_reason: "stop". A well-behaved model ends most chat turns this way. - The operator stops it —
num_output_tokens >= max_tokens. MarkedFINISHED_LENGTH→ APIfinish_reason: "length". To an API consumer this usually means "your answer was truncated; consider a bigger budget." To the operator it's the lever that bounds worst-case cost and KV occupancy per request — schedulers need a worst case to exist (remember the deadlock argument coming in Phase 3 lab-04). - Someone aborts it — client disconnect, admin action. Real vLLM has
FINISHED_ABORTEDfor this;mini_vllmomits it (no clients to disconnect). Worth knowing it exists: cancellation is a first-class lifecycle path in production, and "KV freed on abort" is a real invariant people have broken.
And one anti-way that trips newcomers: ignore_eos=True (used throughout this course's
tests, and by every serious benchmark) disables check #1, so generation always runs to the
cap. Why would anyone want a model to blow through its own stop sign? Benchmarking. If
you're measuring tokens/sec, you need every request to produce a known, fixed number of
tokens regardless of what the model "wants" to say. The flag exists for load generators,
not users — and you've been benefiting from it since lab-01 without noticing: it's what made
your traces deterministic in length.
Files
starter.py— implementfinish_reason(status → API string) andrun_until_stop(the feed-tokens-until-something-fires simulation of the update stage). Your work.solution.py— reference.test_lab.py— the EOS path, the ignore_eos path, the length path, the boundary tie, the unfinished case, and an end-to-end engine check.
Run
LAB_IMPL=starter pytest phase-01-architecture-and-request-lifecycle/labs/lab-05-stop-conditions -q
pytest phase-01-architecture-and-request-lifecycle/labs/lab-05-stop-conditions -q # reference
What to implement
Two functions. finish_reason(request) is the status-to-API translation table.
run_until_stop(token_stream, eos_token_id, sampling_params) replays the engine's update
stage with pre-decided tokens: append one, run maybe_finish(), break if it fired. Using
a scripted token stream instead of a sampler is the trick that makes stop logic exhaustively
testable — you can place an EOS at any position you like, including exactly on the
max_tokens boundary, something you could wait a long time for a sampler to do for you.
(This is also how you should test stop-sequence handling upstream: script the stream, pin
the behavior.)
The edge case the tests are really about
What should happen when the model emits EOS exactly at the max_tokens boundary? Both
conditions are true simultaneously. Look at mini_vllm/request.py::maybe_finish: the EOS
check runs first, so the request reports "stop". That ordering is a deliberate,
user-visible API decision, not an accident of code layout: "stop" tells the consumer "the
answer is complete"; "length" tells them "the answer was cut off — maybe retry with a
bigger budget." On the boundary, the answer is complete — reporting "length" would
invite pointless retries (and with auto-retrying clients, real money). Real vLLM resolves
the tie the same way.
test_eos_on_the_boundary_reports_stop pins this. If someone "tidies up" maybe_finish by
reordering the checks, that test fails — which is the whole job of a test like that: turning
an invisible design decision into a tripwire. Notice the meta-lesson: whenever a function
checks two conditions that can be true at once, the order is an API. Grep any engine
you maintain for such pairs; most of them are untested.
What the tests prove
| Test | What it pins |
|---|---|
test_eos_stops_generation | Tokens after EOS are never generated — the stream truly halts |
test_ignore_eos_runs_to_length_cap | ignore_eos neutralizes check #1 only; the cap still binds |
test_no_eos_hits_length_cap | The cap fires at exactly max_tokens, not ±1 |
test_eos_on_the_boundary_reports_stop | The tie-break above |
test_unfinished_request_has_no_reason | WAITING/RUNNING → None: a streaming response must not carry a finish_reason until the end |
test_engine_reports_length_with_ignore_eos | Your mapper agrees with the engine's real loop, end to end |
How this maps to the real engine
upstream/vllm/v1/request.py—RequestStatusandget_finished_reason(): the same mapping you wrote, plusFINISHED_ABORTED → "abort". Note upstream encodes "is finished" as an ordering on the enum (status > PREEMPTED) —mini_vllmcopies that trick, which is why the enum's declaration order is load-bearing in both. (A reordered enum constant breakingis_finishedis exactly the kind of PR a maintainer learns to catch on sight.)upstream/vllm/v1/engine/output_processor.py— where statuses become thefinish_reasonstrings in API responses, including for streaming (sent only on the final chunk — yourNone-until-finished mapping is what makes that correct).- The real engine checks more stops than these two: stop strings (must be detected on
detokenized text, which means stop handling interacts with the detokenizer's streaming
buffer — a genuinely tricky area), stop_token_ids (per-request custom EOS lists), and
min_tokens(suppress EOS before a floor — the mirror image ofignore_eos). Each is the same shape you built: a predicate over the request's tail, checked in a defined order, inupdate_from_output. When you read that upstream code now, it will parse as "lab-05, four more times."
Hitchhiker's notes
- EOS consumes a token of budget. In
mini_vllm(and in token accounting generally) the EOS lands inoutput_token_ids— yourtest_eos_stops_generationresult was[10, 20, EOS], three tokens spent. APIs differ on whether the EOS is shown (vLLM strips it from text but it exists in the token count). If you've ever wondered why an API bills N+1 tokens for an N-token answer — this is why. max_tokenscounts output, not total. Prompt length lives in a different limit (max_model_len, which caps prompt + output together). Conflating the two produces classic admission bugs: a request with a 4000-token prompt andmax_tokens=200needs 4200 tokens of headroom, and the real scheduler must reserve for the worst case, not the average.- Greedy + a real model can loop forever ("the the the…") — and without a length cap, that request never finishes, never frees its KV, and slowly strangles the server. The cap isn't a UX nicety; it's the engine's guarantee that every admission terminates. Treat any proposal of "unlimited max_tokens" as what it is: a resource-leak feature request.
- Sampling parameters can make EOS unreachable in subtler ways than
ignore_eos: alogit_biasof −∞ on the EOS id, ormin_tokensbefore the floor. The stop machinery composes with the sampler (Phase 9); when stops "mysteriously" don't fire, the sampler is suspect #1.
Going further
- Add stop-string support to
run_until_stop: decode the accumulated output withByteTokenizerafter each token and halt when a given string appears. You'll immediately hit the real-world wrinkle: the stop string can straddle a token boundary, so you must check a sliding window of recent text, not just the newest fragment. Now read how upstream solves it (searchstopinoutput_processor.py) and admire the buffering. - Implement
min_tokens: suppress the EOS check whilenum_output_tokens < min_tokens. One line. Then write the boundary test for it (EOS exactly atmin_tokens) — you know the drill now. - In real vLLM, run a chat model and print
finish_reasonfor: a normal question, the same withmax_tokens=5, and the same withignore_eos=True. Watch"stop","length","length"come back — your three paths, on production silicon.
References
mini_vllm/request.py—maybe_finish(): the eight lines this lab is about.upstream/vllm/v1/request.py—RequestStatus.get_finished_reason.upstream/vllm/v1/engine/output_processor.py— stop strings, streaming finish_reason.- OpenAI API reference, Chat Completions — the
finish_reasoncontract your mapper implements: https://platform.openai.com/docs/api-reference/chat/object - vLLM docs, Sampling Parameters —
stop,stop_token_ids,min_tokens,ignore_eos: https://docs.vllm.ai/en/latest/api/inference_params.html