Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 01-05 — Stop Conditions & Finish Reasons [CPU-OK]

Every request dies. The only questions are when and what we tell the user about it. This lab dissects the engine's stop machinery — the few lines of update_from_output that decide whether a generation halts on the model's own EOS token or on the operator's max_tokens cap — and the mapping from internal status to the finish_reason field that every OpenAI API consumer in the world branches on.

It looks small. It is small. It is also the part of the engine with the highest bug-impact-to-code-size ratio: an off-by-one or a mis-ordered check here doesn't crash — it silently truncates answers, or streams one token too many, for every user, forever.

Contents


Why this lab exists

Ask anyone who's run an LLM API in production what their most common user-facing bug report is. It won't be a crash. It will be: "the answer just… cuts off." Triaging that report requires knowing exactly what you'll know after this lab: was it finish_reason: "length" (the operator's cap — raise max_tokens), "stop" (the model chose to end — a prompting issue), or a stream that died without a reason (an actual bug)? The distinction is three enum values and two if statements, and entire support rotations have burned days for lack of it.

There's an engineering lesson too. Stop handling is where model behavior (EOS is just a token the model can emit, with a probability like any other) meets system policy (max_tokens is an admission-control and billing boundary). Keeping those two cleanly separated — and correctly ordered — is a miniature of the whole serving-systems discipline.

Background: the three ways a request ends

  1. The model stops itself — it samples the EOS (end-of-sequence) token. EOS is not magic: it's a vocabulary entry (id 256 in mini_vllm's ByteTokenizer; id 2 for Llama; <|endoftext|> = 50256 for GPT-2) that the model learned to emit when a response is complete. The engine checks "was the token just appended the EOS?" and if so marks FINISHED_STOPPED → API finish_reason: "stop". A well-behaved model ends most chat turns this way.
  2. The operator stops itnum_output_tokens >= max_tokens. Marked FINISHED_LENGTH → API finish_reason: "length". To an API consumer this usually means "your answer was truncated; consider a bigger budget." To the operator it's the lever that bounds worst-case cost and KV occupancy per request — schedulers need a worst case to exist (remember the deadlock argument coming in Phase 3 lab-04).
  3. Someone aborts it — client disconnect, admin action. Real vLLM has FINISHED_ABORTED for this; mini_vllm omits it (no clients to disconnect). Worth knowing it exists: cancellation is a first-class lifecycle path in production, and "KV freed on abort" is a real invariant people have broken.

And one anti-way that trips newcomers: ignore_eos=True (used throughout this course's tests, and by every serious benchmark) disables check #1, so generation always runs to the cap. Why would anyone want a model to blow through its own stop sign? Benchmarking. If you're measuring tokens/sec, you need every request to produce a known, fixed number of tokens regardless of what the model "wants" to say. The flag exists for load generators, not users — and you've been benefiting from it since lab-01 without noticing: it's what made your traces deterministic in length.

Files

  • starter.py — implement finish_reason (status → API string) and run_until_stop (the feed-tokens-until-something-fires simulation of the update stage). Your work.
  • solution.py — reference.
  • test_lab.py — the EOS path, the ignore_eos path, the length path, the boundary tie, the unfinished case, and an end-to-end engine check.

Run

LAB_IMPL=starter pytest phase-01-architecture-and-request-lifecycle/labs/lab-05-stop-conditions -q
pytest phase-01-architecture-and-request-lifecycle/labs/lab-05-stop-conditions -q   # reference

What to implement

Two functions. finish_reason(request) is the status-to-API translation table. run_until_stop(token_stream, eos_token_id, sampling_params) replays the engine's update stage with pre-decided tokens: append one, run maybe_finish(), break if it fired. Using a scripted token stream instead of a sampler is the trick that makes stop logic exhaustively testable — you can place an EOS at any position you like, including exactly on the max_tokens boundary, something you could wait a long time for a sampler to do for you. (This is also how you should test stop-sequence handling upstream: script the stream, pin the behavior.)

The edge case the tests are really about

What should happen when the model emits EOS exactly at the max_tokens boundary? Both conditions are true simultaneously. Look at mini_vllm/request.py::maybe_finish: the EOS check runs first, so the request reports "stop". That ordering is a deliberate, user-visible API decision, not an accident of code layout: "stop" tells the consumer "the answer is complete"; "length" tells them "the answer was cut off — maybe retry with a bigger budget." On the boundary, the answer is complete — reporting "length" would invite pointless retries (and with auto-retrying clients, real money). Real vLLM resolves the tie the same way.

test_eos_on_the_boundary_reports_stop pins this. If someone "tidies up" maybe_finish by reordering the checks, that test fails — which is the whole job of a test like that: turning an invisible design decision into a tripwire. Notice the meta-lesson: whenever a function checks two conditions that can be true at once, the order is an API. Grep any engine you maintain for such pairs; most of them are untested.

What the tests prove

TestWhat it pins
test_eos_stops_generationTokens after EOS are never generated — the stream truly halts
test_ignore_eos_runs_to_length_capignore_eos neutralizes check #1 only; the cap still binds
test_no_eos_hits_length_capThe cap fires at exactly max_tokens, not ±1
test_eos_on_the_boundary_reports_stopThe tie-break above
test_unfinished_request_has_no_reasonWAITING/RUNNING → None: a streaming response must not carry a finish_reason until the end
test_engine_reports_length_with_ignore_eosYour mapper agrees with the engine's real loop, end to end

How this maps to the real engine

  • upstream/vllm/v1/request.pyRequestStatus and get_finished_reason(): the same mapping you wrote, plus FINISHED_ABORTED → "abort". Note upstream encodes "is finished" as an ordering on the enum (status > PREEMPTED) — mini_vllm copies that trick, which is why the enum's declaration order is load-bearing in both. (A reordered enum constant breaking is_finished is exactly the kind of PR a maintainer learns to catch on sight.)
  • upstream/vllm/v1/engine/output_processor.py — where statuses become the finish_reason strings in API responses, including for streaming (sent only on the final chunk — your None-until-finished mapping is what makes that correct).
  • The real engine checks more stops than these two: stop strings (must be detected on detokenized text, which means stop handling interacts with the detokenizer's streaming buffer — a genuinely tricky area), stop_token_ids (per-request custom EOS lists), and min_tokens (suppress EOS before a floor — the mirror image of ignore_eos). Each is the same shape you built: a predicate over the request's tail, checked in a defined order, in update_from_output. When you read that upstream code now, it will parse as "lab-05, four more times."

Hitchhiker's notes

  • EOS consumes a token of budget. In mini_vllm (and in token accounting generally) the EOS lands in output_token_ids — your test_eos_stops_generation result was [10, 20, EOS], three tokens spent. APIs differ on whether the EOS is shown (vLLM strips it from text but it exists in the token count). If you've ever wondered why an API bills N+1 tokens for an N-token answer — this is why.
  • max_tokens counts output, not total. Prompt length lives in a different limit (max_model_len, which caps prompt + output together). Conflating the two produces classic admission bugs: a request with a 4000-token prompt and max_tokens=200 needs 4200 tokens of headroom, and the real scheduler must reserve for the worst case, not the average.
  • Greedy + a real model can loop forever ("the the the…") — and without a length cap, that request never finishes, never frees its KV, and slowly strangles the server. The cap isn't a UX nicety; it's the engine's guarantee that every admission terminates. Treat any proposal of "unlimited max_tokens" as what it is: a resource-leak feature request.
  • Sampling parameters can make EOS unreachable in subtler ways than ignore_eos: a logit_bias of −∞ on the EOS id, or min_tokens before the floor. The stop machinery composes with the sampler (Phase 9); when stops "mysteriously" don't fire, the sampler is suspect #1.

Going further

  • Add stop-string support to run_until_stop: decode the accumulated output with ByteTokenizer after each token and halt when a given string appears. You'll immediately hit the real-world wrinkle: the stop string can straddle a token boundary, so you must check a sliding window of recent text, not just the newest fragment. Now read how upstream solves it (search stop in output_processor.py) and admire the buffering.
  • Implement min_tokens: suppress the EOS check while num_output_tokens < min_tokens. One line. Then write the boundary test for it (EOS exactly at min_tokens) — you know the drill now.
  • In real vLLM, run a chat model and print finish_reason for: a normal question, the same with max_tokens=5, and the same with ignore_eos=True. Watch "stop", "length", "length" come back — your three paths, on production silicon.

References