Phase 09 — Exercises: Sampling & Decoding Algorithms
Contents
Warm-up (explain)
- What is the pipeline order (penalties → ? → ? → ? → sample) and why does order matter?
- Greedy vs temperature 0 vs top-k=1 — are these the same? When?
- Top-p vs top-k vs min-p — describe each and when it adapts to model confidence.
Core (trace the code)
- In
Sampler.forward(sampler.py:67), where are per-request params read from, and why are they tensors rather than a Python loop? - What is a logits processor (
logits_processor/interface.py)? Name three things it implements. - How does parallel sampling (
parallel_sampling.py) reuse prefix caching forn>1?
Build (your lab)
- In lab-01, why must repetition penalty be applied before temperature?
- Add frequency and presence penalties (count-scaled vs flat) and test their difference.
- Implement a
logit_biaslogits processor (add a constant to specified token ids) and verify a strongly biased token dominates.
Design (staff-level)
- You must apply 256 different
(temperature, top_p, penalties)in one decode step. Sketch the data layout and why a Python loop is unacceptable on the hot path. - A user reports repetitive loops at temperature 0. What knobs help, and what's the tradeoff of each (penalty too high degrades quality)?
- Beam search is requested for a production endpoint. Explain why it's awkward in continuous batching and how you'd bound its cost.
Self-grading
4–6 and 10–12 are interview-grade. Could you whiteboard the batched pipeline and name the files? If not, re-read 01-deep-dive.md.