Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 00-03 — From Logits to Token: Sampling Basics [CPU-OK]

A language model does not produce words. It produces logits — one raw score per vocabulary entry, 257 of them in mini_vllm, 128k+ in Llama-3 — and something has to collapse that scoreboard into the single token the user sees. That something is the sampler, and in this lab you build it: greedy, temperature, top-k, and top-p (nucleus), exactly mirroring mini_vllm/sampler.py — the final test literally checks that your sampler and the engine's agree token-for-token across a grid of configurations.

This is the last piece of the foundations: lab-01 gave you the loop, lab-02 the memory, lab-04 the speed limits — this one gives you the decision each loop iteration ends with.

Contents


Why this lab exists

Sampling parameters are the most-touched, least-understood interface in all of LLM serving. Every API request carries them; every "the model got worse" support ticket is ~30% likely to be a sampling change; and every inference engineer eventually debugs an incident where the answer was "someone set top_k=1 and wondered why outputs got repetitive." You should know these four knobs the way a DBA knows isolation levels — not as folklore ("0.7 is creative!") but as the small, exact algorithms they are.

There's an engine-design reason too. The sampler sits at a peculiar spot in the architecture: it's the only stage that's per-request configurable and stochastic, in the middle of a pipeline that is otherwise batched and deterministic. Getting determinism back when you need it (tests! reproducibility! debugging!) takes deliberate design — the seed parameter, greedy mode — and the entire course's testing strategy (every phase's "identical output" invariants) leans on the greedy shortcut you'll implement here. When Phase 9 expands sampling into penalties, logit processors, parallel sampling, and GPU vectorization, this lab is the kernel of truth it builds on.

Background: the knobs, and what they're actually for

Order matters — this is a pipeline, and each stage reshapes the distribution the next one sees (your sample must apply them in exactly this order to match the engine):

  1. Temperature — divide all logits by T before softmax. T<1 sharpens (rich get richer), T>1 flattens (underdogs get a chance), T→0 approaches argmax. It's the only knob that reweights rather than truncates. The T == 0.0 case is special- cased as pure argmax — both because division by zero, and because greedy must be exactly deterministic, no RNG involved at all.
  2. Top-k — keep the k highest logits, set the rest to −∞ (probability zero after softmax). A blunt truncation: k=1 is greedy-with-extra-steps, k=50 trims the long tail of nonsense tokens. Its weakness: k is fixed while the distribution's actual "width" varies wildly per step (after "The capital of France is" there's one good token; after "My favorite" there are hundreds).
  3. Top-p (nucleus) — keep the smallest set of tokens whose cumulative probability ≥ p. Adaptive where top-k is fixed: confident steps keep few tokens, uncertain steps keep many. The subtle spec detail your implementation must honor: the token that crosses the threshold is included (else p=0.5 over probs [0.4, 0.4, 0.2] would keep only 0.4 < 0.5 — an under-full nucleus).
  4. Softmax + one draw — normalize what survives and draw once with np.random.default_rng(seed). Seeded → reproducible; unseeded → fresh entropy per call.

And the stability clause: softmax must subtract the max before exponentiating. exp(1000) overflows float64; logits in the hundreds are perfectly normal outputs of an unnormalized final layer. This one line is the difference between a sampler and a NaN generator, and the test feeds you logits of 1000+ to make sure it's there.

Files

  • starter.pysoftmax, apply_top_k, apply_top_p, sample, each with its recipe. Your work.
  • solution.py — reference (functionally identical to mini_vllm/sampler.py).
  • test_lab.py — distribution sanity, each knob's exact semantics, determinism, and the agreement test against the engine's Sampler.

Run

LAB_IMPL=starter pytest phase-00-foundations/labs/lab-03-sampling-basics -q
pytest phase-00-foundations/labs/lab-03-sampling-basics -q   # reference (default)

What the tests prove

TestWhat it pins
test_softmax_is_a_distribution_and_is_stableSums to 1, preserves order, and survives logits of 1000 — the max-subtraction clause
test_greedy_is_argmax_and_ignores_every_other_knobtemperature=0 short-circuits the whole pipeline — even hostile top_k/top_p/seed settings can't perturb greedy. This guarantee is what every deterministic test in this course stands on
test_top_k_keeps_exactly_kSurvivors finite, victims −∞, disabled cases (k≤0, k≥vocab) pass through unchanged
test_top_p_keeps_the_smallest_sufficient_nucleusThe inclusive-crossing rule, on a hand-built distribution — and the test deliberately avoids sitting on the cumsum boundary, because float rounding flips the answer there (read the comment; it's a lesson in itself)
test_temperature_sharpens_or_flattensT's monotone effect on the max probability
test_seeded_sampling_is_reproducibleSame logits + same seed = same token, forever
test_agrees_with_mini_vllm_samplerYour sampler ≡ the engine's sampler across 15 configurations — the equivalence that makes this lab "build the real component," not "build a toy like it"

Hitchhiker's notes

  • −∞ is the correct "impossible," not 0. Masking logits to −∞ (probability exactly 0 after softmax) composes cleanly: later stages renormalize over survivors automatically. Masking probabilities to 0 without renormalizing — a classic homebrew-sampler bug — leaves you sampling from a distribution that sums to 0.7.
  • Order of operations is observable. Top-k-then-top-p (this pipeline, and vLLM's) gives different results than top-p-then-top-k for the same parameters. When two engines "with the same settings" produce different output statistics, pipeline order is suspect #2 (suspect #1 is tokenizer differences). The agreement test pins your order to the engine's.
  • Why np.partition instead of sorting in top-k? O(n) vs O(n log n) over the vocab, per token, per request — at 128k vocab × thousands of tokens/s this is real money. Production goes further: vLLM's V1 sampler does top-k/top-p vectorized over the whole batch on the GPU (upstream/vllm/v1/sample/), with exactly the semantics you just wrote scalar. Semantics here, performance there — the course's recurring split.
  • Ties under greedy: argmax takes the lowest index. Sounds trivial until two engines break ties differently and a "deterministic" comparison fails at token 947 — the fp16 near-tie problem from Phase 3 lab-02's notes, one layer down. Determinism is a stack of conventions, and you now know one more layer of it.
  • seed is per-request state in real engines — vLLM keeps a per-request generator so request A's draws don't perturb request B's stream under batching (Phase 9). Your per-call default_rng(seed) is the single-request simplification; the same idea, one request at a time.

Going further

  • Implement min-p (keep tokens with prob ≥ p × max-prob — an increasingly popular alternative that adapts even better than top-p) and write its boundary test. Then check: vLLM ships it (min_p in SamplingParams).
  • Sample 10,000 draws at T ∈ {0.3, 1.0, 2.0} from fixed logits and plot the empirical histograms against your computed distributions — a χ² eyeball test of your own sampler, and a visceral feel for what temperature does.
  • Read upstream/vllm/v1/sample/sampler.py and find the four stages of your pipeline in their batched form: the same algorithm, where every operation is a tensor op over [batch, vocab] and the special-casing of greedy becomes an index-select.

References

  • mini_vllm/sampler.py — the component you just rebuilt; diff yours against it.
  • upstream/vllm/v1/sample/sampler.py — the batched GPU version (Phase 9 territory).
  • Holtzman et al., The Curious Case of Neural Text Degeneration (2019) — the paper that introduced nucleus (top-p) sampling and explains why truncation matters: https://arxiv.org/abs/1904.09751
  • vLLM docs, Sampling Parameters — the full production knob set your four generalize into: https://docs.vllm.ai/en/latest/api/inference_params.html
  • Phase 9 — penalties, logit processors, structured-output masking (Phase 12), and why sampling lives on the GPU.