Lab 09-03 — Beam Search: When Greedy Is a Trap [CPU-OK]
Greedy decoding answers the wrong question. It maximizes each token; what you usually
want is the most probable sequence — and those diverge exactly when the locally best
token leads somewhere bad. Linguists call it a garden path; you'll build one: a tiny
Markov model where greedy confidently takes the 0.6-probability first step into a coin
flip (joint ≈ 0.31), while the humble 0.4 step leads to near-certainty (joint 0.36).
Beam search — carrying the best beam_width partial hypotheses instead of one — escapes
the trap, and you'll implement it properly: the candidate pool, the pruning, and the
EOS-finishes-a-beam bookkeeping that real implementations get subtly wrong.
Contents
- Why this lab exists
- Background: search, not sampling
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Beam search occupies a strange place in modern LLM serving: central to the API surface
(use_beam_search exists, translation/summarization workloads still ask for it),
algorithmically classic — and architecturally awkward for engines like vLLM, awkward
enough that V1 moved it out of the core engine entirely (it's emulated at the API layer
via n parallel candidates plus rescoring-style logic, precisely because per-step
branch-and-prune fights the continuous-batching machine — see the notes). You can't
evaluate that design decision without knowing exactly what the algorithm requires, and
the way to know is to build it: the pooled expansion, the global top-beam_width cut,
the finished-set handling.
The deeper lesson is the trap itself. "Greedy is myopic" is a sentence; your TRAP
fixture is a proof object — four transition probabilities that make the failure exact
and checkable (0.6·0.51 < 0.4·0.9). Once you've built one garden path you'll
recognize the pattern everywhere it matters: why beam search wins on tasks with
constrained correct answers (translation), why it hurts open-ended generation (it
finds high-probability degenerate text — the famous repetition pathology), and why
sampling (labs 01/04) took over for chat.
Background: search, not sampling
Everything else in this phase draws from a distribution; beam search optimizes over
one. State: a set of partial hypotheses with their cumulative log-probabilities. Per
step, each live beam expands by its top beam_width tokens (more children can't help —
at most beam_width survive the global cut), all candidates pool, the best
beam_width survive. Two details carry the correctness:
- Scores are summed log-probs — products of probabilities underflow within a sentence; logs are not an optimization but a necessity.
- EOS finishes a beam: a hypothesis that emits EOS leaves the live set (extending
past EOS is meaningless) but stays in the final ranking against hypotheses that
kept going. Forget this and short, confident answers are silently discarded —
test_eos_finishes_a_beampins it.
Width 1 collapses to greedy exactly (test_beam_width_one_is_greedy) — beam search is
a strict generalization, and the test fixture deliberately avoids probability ties,
because a tie tests your tie-breaker, not your algorithm (a lesson this lab's own test
suite learned the hard way; see the comment in test_lab.py).
Files
starter.py—greedy_decodeandbeam_search(pool → prune → finish). Your work.solution.py— reference.test_lab.py— the trap (greedy falls in, beam escapes), the width-1 identity, monotonicity in width, and EOS handling.
Run
LAB_IMPL=starter pytest phase-09-sampling-and-decoding-algorithms/labs/lab-03-beam-search -q
pytest phase-09-sampling-and-decoding-algorithms/labs/lab-03-beam-search -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_greedy_takes_the_garden_path | Greedy picks A (0.6) and lands at joint ≈ 0.31 — locally best, globally wrong, by construction |
test_beam_escapes_the_trap | Width 2 finds [B, C] at 0.36, strictly beating greedy's joint — the algorithm's entire reason to exist, as an inequality |
test_beam_width_one_is_greedy | The degenerate case: same tokens, same score to 1e-12 |
test_wider_beams_never_score_worse | Monotonicity: more width can't hurt the best score (it can only widen the searched set). Useful as a sanity property — and note what it does not say: that wider is better text (see the degeneration note) |
test_eos_finishes_a_beam | A finished beam survives un-extended and wins the final ranking at its natural length |
Hitchhiker's notes
- Why beam search fights continuous batching: a beam step branches (one hypothesis
becomes several sharing a prefix) and prunes (hypotheses die mid-flight). Branching
wants copy-on-write KV sharing (the prefix is common — Phase 2's
ref_cntmachinery handles the read side, but the diverging tails need careful block forking); pruning frees KV at unpredictable times. All solvable — V0 solved it — but it threads special cases through scheduler and cache for a feature few use, which is why V1 evicted it to the API layer (vllm/beam_search.py+ the OpenAI server's emulation): the engine servesbeam_widthparallel sequences, the wrapper does the pool-and-prune. An instructive case study in what belongs in the core. - Length bias is real and the fix is a hack that works: summed log-probs penalize
length (every token adds a negative number), so beams favor short answers; production
systems divide scores by
len^α(α ≈ 0.6–1.0, "length normalization") before the final ranking. Your EOS test dodges this (the short answer is also the most probable) — add normalization in Going Further and construct the case where it flips the winner. - The degeneration result (Holtzman et al. — the same paper that gave you top-p in lab-01): for open-ended text, exact high-probability sequences are repetitive and dull; beam search finds them, and quality drops as width grows. Probability and quality diverge — arguably the single most important empirical fact about decoding. Beam search survives where the output is tightly constrained (translation, ASR, structured rewriting); sampling owns everything open-ended.
- Beam search is the third member of a family you've now built: greedy (argmax), sampling (draw), beam (search). They share logits and differ in the decision rule — which is why vLLM's logits-processor pipeline (lab-01) sits upstream of all three, and why grammar masking (Phase 12) composes with each of them unchanged.
Going further
- Add length normalization (
score / len(tokens)**alpha) to the final ranking and build a fixture where α = 0 and α = 1 disagree about the winner. You've reproduced the knob every production beam implementation exposes. - Track prefix sharing among your beams: at each step, count how many tokens of KV a
real engine would share via Phase 2's blocks (common prefix length × beams). The
number is large — that's the efficiency beam search loses when emulated naively as
independent sequences without prefix caching, and exactly what
enable_prefix_cachingrecovers (lab-02's mechanism, applied to beams). - Implement diverse beam search (penalize candidates already chosen by sibling groups) and watch the trap fixture: diversity trades best-score for coverage — measure both.
References
upstream/vllm/beam_search.pyand the OpenAI server's beam emulation — the V1 design decision discussed above.- Holtzman et al., The Curious Case of Neural Text Degeneration (2019) — why exact search loses to sampling for open-ended text: https://arxiv.org/abs/1904.09751
- Wu et al., Google's Neural Machine Translation System (2016) — §7, the length normalization formula everyone copied: https://arxiv.org/abs/1609.08144
- Lab-01 — the logits pipeline all decision rules share; lab-02 — the prefix-sharing machinery beams want.