Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 09 Labs — Sampling & Decoding Algorithms

Four labs on the last centimeter of inference: turning logits into tokens, at production grade. The arc: build the full per-request pipeline with its extension hook (lab-01), add the state that makes sampling reproducible under batching (lab-04), meet the search alternative and its garden-path motivation (lab-03), then watch parallel sampling ride three phases of memory machinery on real hardware (lab-02).

Recommended order: 01 → 04 → 03 → 02. (Directory numbers predate labs 03–04.) CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-09-sampling-and-decoding-algorithms/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-09-sampling-and-decoding-algorithms/labs/lab-01-sampling-ops -q

Contents


Labs

lab-01-sampling-ops [CPU-OK]

The production pipeline: custom processors → repetition penalty → temperature → top-k → top-p → min-p → draw, with each placement justified (the order is a theorem, not a convention) and the logits-processor hook that Phase 12's grammar masking rides. Includes min-p's confidence-relative cutoff and the divide-positive/multiply-negative penalty asymmetry. Skills: pipeline order as API; the hook pattern; why penalties need history.

lab-02-parallel-sampling [GPU-OPT]

n=4 on real vLLM: the prompt prefills once, all samples share its KV blocks (ref_cnt=4, 75% hit rate = the pioneer effect with n as denominator), diverging from the first sampled token. The cheapest diversity money can buy, priced exactly. Annotated capture included. Skills: the one-prompt-n-tails cost model; self-consistency economics; n vs separate-requests vs beam search.

lab-03-beam-search [CPU-OK]

Sequence-level search: build greedy and beam decoding, then spring the garden-path trap — a four-probability fixture where greedy's local optimum (joint 0.31) loses to beam's [B, C] (0.36), provably. EOS-finishes-a-beam bookkeeping, the width-1 = greedy identity, and why V1 evicted beams from the engine core. Skills: search vs sampling; log-prob scoring; length bias; probability ≠ quality (degeneration).

lab-04-seeded-rng-batch-invariance [CPU-OK]

The reproducibility contract: a seeded request's tokens must not depend on its batch neighbors. Build the per-request-generator sampler, prove invariance with 0/1/5 interleaved neighbors — and watch the natural shared-RNG implementation fail the same scenario (the control test ships with the lab). Skills: randomness as private state; continuity vs re-seeding; isolation claims need broken controls; the kernel layer of nondeterminism.

What you can do after this phase

Hold the entire logits-to-token path in your head, in order, with reasons; extend it safely through the processor hook (and recognize Phase 12 as one more processor); deliver seeded reproducibility under batching and explain what it does and doesn't promise; choose between sampling, beam search, and best-of-n from their actual cost and quality shapes; and price candidate-generation workloads (self-consistency, RLHF sampling) from the sharing arithmetic. Phase 10 scales the engine across GPUs; the per-request state you isolated here is exactly what has to survive the trip.