Phase 09 — Cheatsheet: Sampling & Decoding Algorithms

The one-liner
The knobs
Logits processors
Batching
Parallel sampling & beam search
Key upstream

The one-liner

Logits → pick a token. The pipeline (penalties → temperature → top-k → top-p/min-p → sample) runs vectorized across a heterogeneous batch, every row with its own params.

The knobs

greedy = T=0 = argmax (deterministic)
temperature T: <1 sharper, >1 flatter
top-k: keep k highest; top-p: keep nucleus (cum prob ≥ p); min-p: keep prob ≥ min_p × max_prob
penalties: repetition/frequency (count) / presence (flat); logit bias; bad-words

Logits processors

The pluggable pre-sampling hook. One mechanism for penalties, bias, bad-words, AND grammar masks (Phase 12: illegal tokens → -inf). logits_processor/{interface,builtin,state}.py.

Batching

Per-request params packed into tensors (SamplingMetadata); masked branch-free ops apply each row's settings in one pass. No Python loop on the hot path.

Parallel sampling & beam search

n>1: one prefill, N samples share prompt KV (prefix caching), diverge after token 1 (parallel_sampling.py). Beam search: top-N partial seqs by cum log-prob; awkward in continuous batching (active set changes), handled specially.

Key upstream

v1/sample/sampler.py:20 Sampler · :67 forward · :223 apply_temperature · :238 sample
v1/sample/ops/topk_topp_sampler.py · ops/penalties.py · ops/bad_words.py
v1/sample/logits_processor/ · v1/sample/metadata.py · sampling_params.py:168

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md

vLLM Mastery — From Zero to Maintainer