Phase 09 — The Hitchhiker's Guide to Sampling & Decoding Algorithms
← Phase 08 · Course home · Phase 10 →
Contents
- Don't Panic
- Step 1: The knobs, from sharpest to softest
- Step 2: Penalties and logit bias
- Step 3: The batching challenge (why this is a systems problem)
- Step 4: Parallel sampling and beam search
- The invariants to memorize
- What you'll do
Don't Panic
The model gives you logits — a score for every token in the vocabulary. You still have to pick one. How you pick is the decoding algorithm: greedy, temperature, top-k, top-p, penalties, parallel sampling, beam search. In a batched engine, all of these must run together, vectorized, on the GPU, for many requests that each chose different settings. This phase is that machinery — small, on the critical path of every single step, and the hook structured output (Phase 12) and penalties plug into.
logits (vocab,) per request
│ logits processors: penalties, bias, bad-words, grammar mask (Phase 12)
│ temperature scaling
│ top-k / top-p / min-p truncation
▼
probability distribution ──sample──► next token id
Step 1: The knobs, from sharpest to softest
- Greedy (temperature 0): always take the argmax. Deterministic. (mini_vllm's default.)
- Temperature
T: divide logits byTbefore softmax.T<1sharpens (more confident),T>1flattens (more random).T→0→ greedy. - Top-k: keep only the
khighest-probability tokens, renormalize, sample from those. - Top-p (nucleus): keep the smallest set of tokens whose cumulative probability ≥
p. Adaptive: few tokens when the model is confident, many when it's unsure. - Min-p: keep tokens with probability ≥
min_p × max_prob. A confidence-relative cutoff.
These compose in a fixed order (penalties → temperature → top-k → top-p/min-p → sample), which
you'll implement in lab-01.
Step 2: Penalties and logit bias
Before sampling you can edit the logits:
- Repetition / frequency / presence penalty: lower the score of tokens already generated, to reduce loops. Frequency scales with count; presence is a flat penalty for any prior occurrence.
- Logit bias: add/subtract a fixed amount for specific token ids (force or ban words).
- Bad words / stop: hard-ban sequences.
All of these are logits processors — the pluggable pre-sampling hook. That same hook is how
structured output (Phase 12) masks illegal tokens to -inf. One clean abstraction, many uses.
Step 3: The batching challenge (why this is a systems problem)
In one decode step you might have 256 requests, each with its own temperature, top-p, penalties.
You can't loop in Python (too slow on the hot path). So vLLM packs per-request params into tensors
aligned with the batch and applies vectorized, branch-free masked ops — every row uses its own
settings in one kernel. That's the real engineering: not the math of top-p, but doing top-p for a
heterogeneous batch in one GPU pass (vllm/v1/sample/ops/topk_topp_sampler.py).
Step 4: Parallel sampling and beam search
- Parallel sampling (
n>1): produce N independent completions for one prompt. The prompt is processed once; the N samples share the prompt's KV blocks via prefix caching (Phase 2/3) and diverge only after the first sampled token. A beautiful reuse of paging. - Beam search: keep the top-N partial sequences by cumulative log-prob, expanding and pruning each step. It's awkward in a continuous-batching engine (beams branch and die, changing the batch shape), so vLLM handles it specially rather than as plain sampling.
The invariants to memorize
- Order: penalties/bias → temperature → top-k → top-p/min-p → sample.
- Greedy = temperature 0 = argmax (deterministic).
- Everything is vectorized across a heterogeneous batch with per-row params.
- Logits processors are the pre-sampling hook (penalties, bias, grammar masks).
n>1shares the prompt's KV via prefix caching; beam search is the special case.
What you'll do
- Read: 01-deep-dive.md —
Sampler.forward, the top-k/top-p ops, penalties, and the logits-processor framework, line-anchored. - Build: 02-mini-build.md — add min-p, repetition penalty, and a logits-processor pipeline.
- Labs (see labs/README.md; recommended order 01 → 04 → 03 → 02):
lab-01-sampling-ops[CPU-OK]— implement temperature/top-k/top-p/min-p + repetition penalty + a logits-processor hook; pin their effects with tests.lab-02-parallel-sampling[GPU-OPT]— runn>1on real vLLM; observe shared prompt KV (captured output).lab-03-beam-search[CPU-OK]— build beam search and spring the garden-path trap where greedy provably loses; EOS-finishes-a-beam; why V1 evicted beams from the engine core.lab-04-seeded-rng-batch-invariance[CPU-OK]— per-request generators: prove a seeded request's tokens survive any batch neighbors (and watch the shared-RNG version fail).
- Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.
← Phase 08 · Course home · Phase 10 →