Phase 09 — Exercises: Sampling & Decoding Algorithms

Contents

Warm-up (explain)
Core (trace the code)
Build (your lab)
Design (staff-level)
Self-grading

Warm-up (explain)

What is the pipeline order (penalties → ? → ? → ? → sample) and why does order matter?
Greedy vs temperature 0 vs top-k=1 — are these the same? When?
Top-p vs top-k vs min-p — describe each and when it adapts to model confidence.

Core (trace the code)

In Sampler.forward (sampler.py:67), where are per-request params read from, and why are they tensors rather than a Python loop?
What is a logits processor (logits_processor/interface.py)? Name three things it implements.
How does parallel sampling (parallel_sampling.py) reuse prefix caching for n>1?

Build (your lab)

In lab-01, why must repetition penalty be applied before temperature?
Add frequency and presence penalties (count-scaled vs flat) and test their difference.
Implement a logit_bias logits processor (add a constant to specified token ids) and verify a strongly biased token dominates.

Design (staff-level)

You must apply 256 different (temperature, top_p, penalties) in one decode step. Sketch the data layout and why a Python loop is unacceptable on the hot path.
A user reports repetitive loops at temperature 0. What knobs help, and what's the tradeoff of each (penalty too high degrades quality)?
Beam search is requested for a production endpoint. Explain why it's awkward in continuous batching and how you'd bound its cost.

Self-grading

4–6 and 10–12 are interview-grade. Could you whiteboard the batched pipeline and name the files? If not, re-read 01-deep-dive.md.