Phase 09 — Cheatsheet: Sampling & Decoding Algorithms
Contents
The one-liner
Logits → pick a token. The pipeline (penalties → temperature → top-k → top-p/min-p → sample) runs vectorized across a heterogeneous batch, every row with its own params.
The knobs
- greedy = T=0 = argmax (deterministic)
- temperature T: <1 sharper, >1 flatter
- top-k: keep k highest; top-p: keep nucleus (cum prob ≥ p); min-p: keep prob ≥ min_p × max_prob
- penalties: repetition/frequency (count) / presence (flat); logit bias; bad-words
Logits processors
The pluggable pre-sampling hook. One mechanism for penalties, bias, bad-words, AND grammar masks
(Phase 12: illegal tokens → -inf). logits_processor/{interface,builtin,state}.py.
Batching
Per-request params packed into tensors (SamplingMetadata); masked branch-free ops apply each
row's settings in one pass. No Python loop on the hot path.
Parallel sampling & beam search
n>1: one prefill, N samples share prompt KV (prefix caching), diverge after token 1
(parallel_sampling.py). Beam search: top-N partial seqs by cum log-prob; awkward in continuous
batching (active set changes), handled specially.
Key upstream
v1/sample/sampler.py:20Sampler ·:67forward ·:223apply_temperature ·:238samplev1/sample/ops/topk_topp_sampler.py·ops/penalties.py·ops/bad_words.pyv1/sample/logits_processor/·v1/sample/metadata.py·sampling_params.py:168
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md