Lab 00-03 — From Logits to Token: Sampling Basics [CPU-OK]
A language model does not produce words. It produces logits — one raw score per
vocabulary entry, 257 of them in mini_vllm, 128k+ in Llama-3 — and something has to
collapse that scoreboard into the single token the user sees. That something is the
sampler, and in this lab you build it: greedy, temperature, top-k, and top-p (nucleus),
exactly mirroring mini_vllm/sampler.py — the final test literally checks that your
sampler and the engine's agree token-for-token across a grid of configurations.
This is the last piece of the foundations: lab-01 gave you the loop, lab-02 the memory, lab-04 the speed limits — this one gives you the decision each loop iteration ends with.
Contents
- Why this lab exists
- Background: the knobs, and what they're actually for
- Files
- Run
- What to implement
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Sampling parameters are the most-touched, least-understood interface in all of LLM
serving. Every API request carries them; every "the model got worse" support ticket is
~30% likely to be a sampling change; and every inference engineer eventually debugs an
incident where the answer was "someone set top_k=1 and wondered why outputs got
repetitive." You should know these four knobs the way a DBA knows isolation levels — not
as folklore ("0.7 is creative!") but as the small, exact algorithms they are.
There's an engine-design reason too. The sampler sits at a peculiar spot in the
architecture: it's the only stage that's per-request configurable and stochastic, in
the middle of a pipeline that is otherwise batched and deterministic. Getting determinism
back when you need it (tests! reproducibility! debugging!) takes deliberate design — the
seed parameter, greedy mode — and the entire course's testing strategy (every phase's
"identical output" invariants) leans on the greedy shortcut you'll implement here. When
Phase 9 expands sampling into penalties, logit processors, parallel sampling, and GPU
vectorization, this lab is the kernel of truth it builds on.
Background: the knobs, and what they're actually for
Order matters — this is a pipeline, and each stage reshapes the distribution the next one
sees (your sample must apply them in exactly this order to match the engine):
- Temperature — divide all logits by T before softmax. T<1 sharpens (rich get
richer), T>1 flattens (underdogs get a chance), T→0 approaches argmax. It's the
only knob that reweights rather than truncates. The
T == 0.0case is special- cased as pure argmax — both because division by zero, and because greedy must be exactly deterministic, no RNG involved at all. - Top-k — keep the k highest logits, set the rest to −∞ (probability zero after
softmax). A blunt truncation: k=1 is greedy-with-extra-steps, k=50 trims the long
tail of nonsense tokens. Its weakness: k is fixed while the distribution's actual
"width" varies wildly per step (after
"The capital of France is"there's one good token; after"My favorite"there are hundreds). - Top-p (nucleus) — keep the smallest set of tokens whose cumulative probability ≥ p. Adaptive where top-k is fixed: confident steps keep few tokens, uncertain steps keep many. The subtle spec detail your implementation must honor: the token that crosses the threshold is included (else p=0.5 over probs [0.4, 0.4, 0.2] would keep only 0.4 < 0.5 — an under-full nucleus).
- Softmax + one draw — normalize what survives and draw once with
np.random.default_rng(seed). Seeded → reproducible; unseeded → fresh entropy per call.
And the stability clause: softmax must subtract the max before exponentiating.
exp(1000) overflows float64; logits in the hundreds are perfectly normal outputs of an
unnormalized final layer. This one line is the difference between a sampler and a NaN
generator, and the test feeds you logits of 1000+ to make sure it's there.
Files
starter.py—softmax,apply_top_k,apply_top_p,sample, each with its recipe. Your work.solution.py— reference (functionally identical tomini_vllm/sampler.py).test_lab.py— distribution sanity, each knob's exact semantics, determinism, and the agreement test against the engine'sSampler.
Run
LAB_IMPL=starter pytest phase-00-foundations/labs/lab-03-sampling-basics -q
pytest phase-00-foundations/labs/lab-03-sampling-basics -q # reference (default)
What the tests prove
| Test | What it pins |
|---|---|
test_softmax_is_a_distribution_and_is_stable | Sums to 1, preserves order, and survives logits of 1000 — the max-subtraction clause |
test_greedy_is_argmax_and_ignores_every_other_knob | temperature=0 short-circuits the whole pipeline — even hostile top_k/top_p/seed settings can't perturb greedy. This guarantee is what every deterministic test in this course stands on |
test_top_k_keeps_exactly_k | Survivors finite, victims −∞, disabled cases (k≤0, k≥vocab) pass through unchanged |
test_top_p_keeps_the_smallest_sufficient_nucleus | The inclusive-crossing rule, on a hand-built distribution — and the test deliberately avoids sitting on the cumsum boundary, because float rounding flips the answer there (read the comment; it's a lesson in itself) |
test_temperature_sharpens_or_flattens | T's monotone effect on the max probability |
test_seeded_sampling_is_reproducible | Same logits + same seed = same token, forever |
test_agrees_with_mini_vllm_sampler | Your sampler ≡ the engine's sampler across 15 configurations — the equivalence that makes this lab "build the real component," not "build a toy like it" |
Hitchhiker's notes
- −∞ is the correct "impossible," not 0. Masking logits to −∞ (probability exactly 0 after softmax) composes cleanly: later stages renormalize over survivors automatically. Masking probabilities to 0 without renormalizing — a classic homebrew-sampler bug — leaves you sampling from a distribution that sums to 0.7.
- Order of operations is observable. Top-k-then-top-p (this pipeline, and vLLM's) gives different results than top-p-then-top-k for the same parameters. When two engines "with the same settings" produce different output statistics, pipeline order is suspect #2 (suspect #1 is tokenizer differences). The agreement test pins your order to the engine's.
- Why
np.partitioninstead of sorting in top-k? O(n) vs O(n log n) over the vocab, per token, per request — at 128k vocab × thousands of tokens/s this is real money. Production goes further: vLLM's V1 sampler does top-k/top-p vectorized over the whole batch on the GPU (upstream/vllm/v1/sample/), with exactly the semantics you just wrote scalar. Semantics here, performance there — the course's recurring split. - Ties under greedy:
argmaxtakes the lowest index. Sounds trivial until two engines break ties differently and a "deterministic" comparison fails at token 947 — the fp16 near-tie problem from Phase 3 lab-02's notes, one layer down. Determinism is a stack of conventions, and you now know one more layer of it. seedis per-request state in real engines — vLLM keeps a per-request generator so request A's draws don't perturb request B's stream under batching (Phase 9). Your per-calldefault_rng(seed)is the single-request simplification; the same idea, one request at a time.
Going further
- Implement min-p (keep tokens with prob ≥ p × max-prob — an increasingly popular
alternative that adapts even better than top-p) and write its boundary test. Then
check: vLLM ships it (
min_pinSamplingParams). - Sample 10,000 draws at T ∈ {0.3, 1.0, 2.0} from fixed logits and plot the empirical histograms against your computed distributions — a χ² eyeball test of your own sampler, and a visceral feel for what temperature does.
- Read
upstream/vllm/v1/sample/sampler.pyand find the four stages of your pipeline in their batched form: the same algorithm, where every operation is a tensor op over[batch, vocab]and the special-casing of greedy becomes an index-select.
References
mini_vllm/sampler.py— the component you just rebuilt; diff yours against it.upstream/vllm/v1/sample/sampler.py— the batched GPU version (Phase 9 territory).- Holtzman et al., The Curious Case of Neural Text Degeneration (2019) — the paper that introduced nucleus (top-p) sampling and explains why truncation matters: https://arxiv.org/abs/1904.09751
- vLLM docs, Sampling Parameters — the full production knob set your four generalize into: https://docs.vllm.ai/en/latest/api/inference_params.html
- Phase 9 — penalties, logit processors, structured-output masking (Phase 12), and why sampling lives on the GPU.