Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 09 — Exercises: Sampling & Decoding Algorithms

Contents


Warm-up (explain)

  1. What is the pipeline order (penalties → ? → ? → ? → sample) and why does order matter?
  2. Greedy vs temperature 0 vs top-k=1 — are these the same? When?
  3. Top-p vs top-k vs min-p — describe each and when it adapts to model confidence.

Core (trace the code)

  1. In Sampler.forward (sampler.py:67), where are per-request params read from, and why are they tensors rather than a Python loop?
  2. What is a logits processor (logits_processor/interface.py)? Name three things it implements.
  3. How does parallel sampling (parallel_sampling.py) reuse prefix caching for n>1?

Build (your lab)

  1. In lab-01, why must repetition penalty be applied before temperature?
  2. Add frequency and presence penalties (count-scaled vs flat) and test their difference.
  3. Implement a logit_bias logits processor (add a constant to specified token ids) and verify a strongly biased token dominates.

Design (staff-level)

  1. You must apply 256 different (temperature, top_p, penalties) in one decode step. Sketch the data layout and why a Python loop is unacceptable on the hot path.
  2. A user reports repetitive loops at temperature 0. What knobs help, and what's the tradeoff of each (penalty too high degrades quality)?
  3. Beam search is requested for a production endpoint. Explain why it's awkward in continuous batching and how you'd bound its cost.

Self-grading

4–6 and 10–12 are interview-grade. Could you whiteboard the batched pipeline and name the files? If not, re-read 01-deep-dive.md.