Phase 09 — Interview Questions: Sampling & Decoding Algorithms
Q1. Walk through the sampling pipeline.
Model answer
Logits → logits processors (penalties, logit bias, bad-words, grammar mask) → temperature scaling
→ top-k truncation → top-p/min-p truncation → sample (argmax for greedy rows, multinomial
otherwise). Order matters: penalties edit raw logits, temperature reshapes, top-k/p prune the
support, then you draw. (Sampler.forward, sampler.py:67.)
Q2. How do you apply different sampling params per request in one batched kernel?
Model answer
Pack per-request params (temperature, top_k, top_p, penalties, seeds) into tensors aligned with
the batch (SamplingMetadata), then apply vectorized, branch-free masked ops so each row uses its
own settings in one GPU pass. Greedy rows go through a temperature→argmax path. No Python
per-request loop on the hot path — that's the systems challenge, not the math.
Q3. top-k vs top-p vs min-p?
Model answer
top-k keeps a fixed number of highest-prob tokens; top-p (nucleus) keeps the smallest set whose cumulative prob ≥ p (adaptive — few when confident, many when unsure); min-p keeps tokens with prob ≥ min_p × max_prob (a confidence-relative floor). top-p and min-p adapt to the distribution's shape; top-k doesn't.
Q4. What is a logits processor and why is it the right abstraction?
Model answer
A hook that transforms logits at a defined point before sampling. It cleanly composes penalties,
logit bias, bad-words, and — crucially — structured-output grammar masks (Phase 12), all without
special-casing the sampler. Build it once and constrained decoding becomes "a processor that sets
illegal tokens to -inf." (logits_processor/interface.py.)
Q5. How does n>1 parallel sampling work efficiently?
Model answer
The prompt is prefilled once; the N samples share its KV blocks via prefix caching (Phase 2/3) and
diverge only after the first sampled token, each carrying its own RNG/params. So N completions cost
~one prefill plus N decodes, not N full requests. (parallel_sampling.py.) Beam search can't share
this way because it prunes/branches the active set each step.
Rapid-fire
- Greedy = ? temperature 0 = argmax.
- Pipeline order? penalties → temperature → top-k → top-p/min-p → sample.
- Per-request params live in?
SamplingMetadata(tensors). - The pre-sampling hook? logits processors.
n>1reuses? prefix caching (shared prompt KV).