Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 09 — Interview Questions: Sampling & Decoding Algorithms

Q1. Walk through the sampling pipeline.

Model answer

Logits → logits processors (penalties, logit bias, bad-words, grammar mask) → temperature scaling → top-k truncation → top-p/min-p truncation → sample (argmax for greedy rows, multinomial otherwise). Order matters: penalties edit raw logits, temperature reshapes, top-k/p prune the support, then you draw. (Sampler.forward, sampler.py:67.)

Q2. How do you apply different sampling params per request in one batched kernel?

Model answer

Pack per-request params (temperature, top_k, top_p, penalties, seeds) into tensors aligned with the batch (SamplingMetadata), then apply vectorized, branch-free masked ops so each row uses its own settings in one GPU pass. Greedy rows go through a temperature→argmax path. No Python per-request loop on the hot path — that's the systems challenge, not the math.

Q3. top-k vs top-p vs min-p?

Model answer

top-k keeps a fixed number of highest-prob tokens; top-p (nucleus) keeps the smallest set whose cumulative prob ≥ p (adaptive — few when confident, many when unsure); min-p keeps tokens with prob ≥ min_p × max_prob (a confidence-relative floor). top-p and min-p adapt to the distribution's shape; top-k doesn't.

Q4. What is a logits processor and why is it the right abstraction?

Model answer

A hook that transforms logits at a defined point before sampling. It cleanly composes penalties, logit bias, bad-words, and — crucially — structured-output grammar masks (Phase 12), all without special-casing the sampler. Build it once and constrained decoding becomes "a processor that sets illegal tokens to -inf." (logits_processor/interface.py.)

Q5. How does n>1 parallel sampling work efficiently?

Model answer

The prompt is prefilled once; the N samples share its KV blocks via prefix caching (Phase 2/3) and diverge only after the first sampled token, each carrying its own RNG/params. So N completions cost ~one prefill plus N decodes, not N full requests. (parallel_sampling.py.) Beam search can't share this way because it prunes/branches the active set each step.

Rapid-fire

  • Greedy = ? temperature 0 = argmax.
  • Pipeline order? penalties → temperature → top-k → top-p/min-p → sample.
  • Per-request params live in? SamplingMetadata (tensors).
  • The pre-sampling hook? logits processors.
  • n>1 reuses? prefix caching (shared prompt KV).