Lab 09-03 — Beam Search: When Greedy Is a Trap `[CPU-OK]`

Greedy decoding answers the wrong question. It maximizes each token; what you usually want is the most probable sequence — and those diverge exactly when the locally best token leads somewhere bad. Linguists call it a garden path; you'll build one: a tiny Markov model where greedy confidently takes the 0.6-probability first step into a coin flip (joint ≈ 0.31), while the humble 0.4 step leads to near-certainty (joint 0.36). Beam search — carrying the best beam_width partial hypotheses instead of one — escapes the trap, and you'll implement it properly: the candidate pool, the pruning, and the EOS-finishes-a-beam bookkeeping that real implementations get subtly wrong.

Why this lab exists
Background: search, not sampling
Files
Run
What the tests prove
Hitchhiker's notes
Going further
References

Why this lab exists

Beam search occupies a strange place in modern LLM serving: central to the API surface (use_beam_search exists, translation/summarization workloads still ask for it), algorithmically classic — and architecturally awkward for engines like vLLM, awkward enough that V1 moved it out of the core engine entirely (it's emulated at the API layer via n parallel candidates plus rescoring-style logic, precisely because per-step branch-and-prune fights the continuous-batching machine — see the notes). You can't evaluate that design decision without knowing exactly what the algorithm requires, and the way to know is to build it: the pooled expansion, the global top-beam_width cut, the finished-set handling.

The deeper lesson is the trap itself. "Greedy is myopic" is a sentence; your TRAP fixture is a proof object — four transition probabilities that make the failure exact and checkable (0.6·0.51 < 0.4·0.9). Once you've built one garden path you'll recognize the pattern everywhere it matters: why beam search wins on tasks with constrained correct answers (translation), why it hurts open-ended generation (it finds high-probability degenerate text — the famous repetition pathology), and why sampling (labs 01/04) took over for chat.

Background: search, not sampling

Everything else in this phase draws from a distribution; beam search optimizes over one. State: a set of partial hypotheses with their cumulative log-probabilities. Per step, each live beam expands by its top beam_width tokens (more children can't help — at most beam_width survive the global cut), all candidates pool, the best beam_width survive. Two details carry the correctness:

Scores are summed log-probs — products of probabilities underflow within a sentence; logs are not an optimization but a necessity.
EOS finishes a beam: a hypothesis that emits EOS leaves the live set (extending past EOS is meaningless) but stays in the final ranking against hypotheses that kept going. Forget this and short, confident answers are silently discarded — test_eos_finishes_a_beam pins it.

Width 1 collapses to greedy exactly (test_beam_width_one_is_greedy) — beam search is a strict generalization, and the test fixture deliberately avoids probability ties, because a tie tests your tie-breaker, not your algorithm (a lesson this lab's own test suite learned the hard way; see the comment in test_lab.py).

Files

starter.py — greedy_decode and beam_search (pool → prune → finish). Your work.
solution.py — reference.
test_lab.py — the trap (greedy falls in, beam escapes), the width-1 identity, monotonicity in width, and EOS handling.

Run

LAB_IMPL=starter pytest phase-09-sampling-and-decoding-algorithms/labs/lab-03-beam-search -q
pytest phase-09-sampling-and-decoding-algorithms/labs/lab-03-beam-search -q   # reference

What the tests prove

Test	What it pins
`test_greedy_takes_the_garden_path`	Greedy picks A (0.6) and lands at joint ≈ 0.31 — locally best, globally wrong, by construction
`test_beam_escapes_the_trap`	Width 2 finds [B, C] at 0.36, strictly beating greedy's joint — the algorithm's entire reason to exist, as an inequality
`test_beam_width_one_is_greedy`	The degenerate case: same tokens, same score to 1e-12
`test_wider_beams_never_score_worse`	Monotonicity: more width can't hurt the best score (it can only widen the searched set). Useful as a sanity property — and note what it does not say: that wider is better text (see the degeneration note)
`test_eos_finishes_a_beam`	A finished beam survives un-extended and wins the final ranking at its natural length

Hitchhiker's notes

Why beam search fights continuous batching: a beam step branches (one hypothesis becomes several sharing a prefix) and prunes (hypotheses die mid-flight). Branching wants copy-on-write KV sharing (the prefix is common — Phase 2's ref_cnt machinery handles the read side, but the diverging tails need careful block forking); pruning frees KV at unpredictable times. All solvable — V0 solved it — but it threads special cases through scheduler and cache for a feature few use, which is why V1 evicted it to the API layer (vllm/beam_search.py + the OpenAI server's emulation): the engine serves beam_width parallel sequences, the wrapper does the pool-and-prune. An instructive case study in what belongs in the core.
Length bias is real and the fix is a hack that works: summed log-probs penalize length (every token adds a negative number), so beams favor short answers; production systems divide scores by len^α (α ≈ 0.6–1.0, "length normalization") before the final ranking. Your EOS test dodges this (the short answer is also the most probable) — add normalization in Going Further and construct the case where it flips the winner.
The degeneration result (Holtzman et al. — the same paper that gave you top-p in lab-01): for open-ended text, exact high-probability sequences are repetitive and dull; beam search finds them, and quality drops as width grows. Probability and quality diverge — arguably the single most important empirical fact about decoding. Beam search survives where the output is tightly constrained (translation, ASR, structured rewriting); sampling owns everything open-ended.
Beam search is the third member of a family you've now built: greedy (argmax), sampling (draw), beam (search). They share logits and differ in the decision rule — which is why vLLM's logits-processor pipeline (lab-01) sits upstream of all three, and why grammar masking (Phase 12) composes with each of them unchanged.

Going further

Add length normalization (score / len(tokens)**alpha) to the final ranking and build a fixture where α = 0 and α = 1 disagree about the winner. You've reproduced the knob every production beam implementation exposes.
Track prefix sharing among your beams: at each step, count how many tokens of KV a real engine would share via Phase 2's blocks (common prefix length × beams). The number is large — that's the efficiency beam search loses when emulated naively as independent sequences without prefix caching, and exactly what enable_prefix_caching recovers (lab-02's mechanism, applied to beams).
Implement diverse beam search (penalize candidates already chosen by sibling groups) and watch the trap fixture: diversity trades best-score for coverage — measure both.

References

upstream/vllm/beam_search.py and the OpenAI server's beam emulation — the V1 design decision discussed above.
Holtzman et al., The Curious Case of Neural Text Degeneration (2019) — why exact search loses to sampling for open-ended text: https://arxiv.org/abs/1904.09751
Wu et al., Google's Neural Machine Translation System (2016) — §7, the length normalization formula everyone copied: https://arxiv.org/abs/1609.08144
Lab-01 — the logits pipeline all decision rules share; lab-02 — the prefix-sharing machinery beams want.

vLLM Mastery — From Zero to Maintainer