Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 09 — The Hitchhiker's Guide to Sampling & Decoding Algorithms

Phase 08 · Course home · Phase 10

Contents


Don't Panic

The model gives you logits — a score for every token in the vocabulary. You still have to pick one. How you pick is the decoding algorithm: greedy, temperature, top-k, top-p, penalties, parallel sampling, beam search. In a batched engine, all of these must run together, vectorized, on the GPU, for many requests that each chose different settings. This phase is that machinery — small, on the critical path of every single step, and the hook structured output (Phase 12) and penalties plug into.

logits (vocab,) per request
   │  logits processors: penalties, bias, bad-words, grammar mask (Phase 12)
   │  temperature scaling
   │  top-k / top-p / min-p truncation
   ▼
 probability distribution  ──sample──►  next token id

Step 1: The knobs, from sharpest to softest

  • Greedy (temperature 0): always take the argmax. Deterministic. (mini_vllm's default.)
  • Temperature T: divide logits by T before softmax. T<1 sharpens (more confident), T>1 flattens (more random). T→0 → greedy.
  • Top-k: keep only the k highest-probability tokens, renormalize, sample from those.
  • Top-p (nucleus): keep the smallest set of tokens whose cumulative probability ≥ p. Adaptive: few tokens when the model is confident, many when it's unsure.
  • Min-p: keep tokens with probability ≥ min_p × max_prob. A confidence-relative cutoff.

These compose in a fixed order (penalties → temperature → top-k → top-p/min-p → sample), which you'll implement in lab-01.


Step 2: Penalties and logit bias

Before sampling you can edit the logits:

  • Repetition / frequency / presence penalty: lower the score of tokens already generated, to reduce loops. Frequency scales with count; presence is a flat penalty for any prior occurrence.
  • Logit bias: add/subtract a fixed amount for specific token ids (force or ban words).
  • Bad words / stop: hard-ban sequences.

All of these are logits processors — the pluggable pre-sampling hook. That same hook is how structured output (Phase 12) masks illegal tokens to -inf. One clean abstraction, many uses.


Step 3: The batching challenge (why this is a systems problem)

In one decode step you might have 256 requests, each with its own temperature, top-p, penalties. You can't loop in Python (too slow on the hot path). So vLLM packs per-request params into tensors aligned with the batch and applies vectorized, branch-free masked ops — every row uses its own settings in one kernel. That's the real engineering: not the math of top-p, but doing top-p for a heterogeneous batch in one GPU pass (vllm/v1/sample/ops/topk_topp_sampler.py).


  • Parallel sampling (n>1): produce N independent completions for one prompt. The prompt is processed once; the N samples share the prompt's KV blocks via prefix caching (Phase 2/3) and diverge only after the first sampled token. A beautiful reuse of paging.
  • Beam search: keep the top-N partial sequences by cumulative log-prob, expanding and pruning each step. It's awkward in a continuous-batching engine (beams branch and die, changing the batch shape), so vLLM handles it specially rather than as plain sampling.

The invariants to memorize

  1. Order: penalties/bias → temperature → top-k → top-p/min-p → sample.
  2. Greedy = temperature 0 = argmax (deterministic).
  3. Everything is vectorized across a heterogeneous batch with per-row params.
  4. Logits processors are the pre-sampling hook (penalties, bias, grammar masks).
  5. n>1 shares the prompt's KV via prefix caching; beam search is the special case.

What you'll do

  • Read: 01-deep-dive.mdSampler.forward, the top-k/top-p ops, penalties, and the logits-processor framework, line-anchored.
  • Build: 02-mini-build.md — add min-p, repetition penalty, and a logits-processor pipeline.
  • Labs (see labs/README.md; recommended order 01 → 04 → 03 → 02):
    • lab-01-sampling-ops [CPU-OK] — implement temperature/top-k/top-p/min-p + repetition penalty + a logits-processor hook; pin their effects with tests.
    • lab-02-parallel-sampling [GPU-OPT] — run n>1 on real vLLM; observe shared prompt KV (captured output).
    • lab-03-beam-search [CPU-OK] — build beam search and spring the garden-path trap where greedy provably loses; EOS-finishes-a-beam; why V1 evicted beams from the engine core.
    • lab-04-seeded-rng-batch-invariance [CPU-OK] — per-request generators: prove a seeded request's tokens survive any batch neighbors (and watch the shared-RNG version fail).
  • Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.

Phase 08 · Course home · Phase 10