Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 00 — Interview Questions: Foundations

Cover the answer, attempt out loud, compare. These fundamentals gate everything else — if you fumble them, the interviewer won't trust your scheduler answers.

Q1. Why is autoregressive decoding so much slower per token than prefill?

Model answer

Decode produces one token per step but must still read the entire model weights and the whole KV cache from HBM each step, while doing only one token's worth of math — terrible arithmetic intensity, so it's memory-bandwidth-bound and the GPU's compute sits idle. Prefill amortizes the same weight read over all prompt tokens at once, so it's compute-bound and far more efficient per token. Same kernels, opposite bottlenecks.

Q2. What is the KV cache and why does it dominate serving memory?

Model answer

It stores the Key and Value vectors of every prior token so attention need not recompute them (they never change). Without it, generation is O(N²) in work; with it, O(N). Its size is 2 × layers × kv_heads × head_dim × dtype_bytes per token and it grows linearly with batch size and sequence length, so at scale it dwarfs the weights and caps how many concurrent requests fit. For Llama-3-8B that's ~128 KiB/token; a few thousand tokens × a few dozen users fills tens of GB.

Q3. Walk me through prefill vs decode.

Model answer

Prefill is the first pass over the whole prompt: many tokens, one pass, compute-bound, fills the prompt's KV cache, determines TTFT. Decode is every subsequent single-token step: one token, memory-bandwidth-bound (read all weights + KV), determines ITL/TPOT. The scheduler treats both uniformly as "advance num_computed_tokens toward num_tokens," which is why chunked prefill and continuous batching fall out naturally (Phase 3).

Q4. How would you estimate KV-cache memory for a deployment?

Model answer

kv_bytes_per_token = 2 × num_layers × num_kv_heads × head_dim × dtype_bytes; multiply by max sequence length for per-sequence bytes; concurrent capacity ≈ (HBM − weights) / per-sequence bytes. Watch for GQA (kv_heads ≪ query_heads shrinks it), fp8 KV cache (halves dtype_bytes), and that real engines reserve some HBM for activations and CUDA-graph buffers, so usable KV is a bit less than the naive free figure.

Q5. Why does batching improve throughput, and what's the cost?

Model answer

In decode, reading the model weights from HBM is the dominant cost and is shared across a batch — so processing B sequences together costs barely more than one, multiplying throughput. The cost is latency: each step does more work, and (via Little's Law) higher concurrency means each request waits longer. The scheduler navigates this; Phase 18 tunes it.

Rapid-fire

  • Tokens are roughly? ~¾ of a word; integer ids from a tokenizer.
  • Logits are? Pre-softmax scores over the whole vocabulary for the next token.
  • Decode bottleneck? Memory bandwidth. Prefill bottleneck? Compute.
  • TTFT driven by? Prefill. ITL driven by? Decode.
  • KV bytes/token formula? 2 × layers × kv_heads × head_dim × dtype_bytes.
  • The engine's master variables? num_computed_tokens chasing num_tokens.