Phase 00 — Interview Questions: Foundations
Cover the answer, attempt out loud, compare. These fundamentals gate everything else — if you fumble them, the interviewer won't trust your scheduler answers.
Q1. Why is autoregressive decoding so much slower per token than prefill?
Model answer
Decode produces one token per step but must still read the entire model weights and the whole KV cache from HBM each step, while doing only one token's worth of math — terrible arithmetic intensity, so it's memory-bandwidth-bound and the GPU's compute sits idle. Prefill amortizes the same weight read over all prompt tokens at once, so it's compute-bound and far more efficient per token. Same kernels, opposite bottlenecks.
Q2. What is the KV cache and why does it dominate serving memory?
Model answer
It stores the Key and Value vectors of every prior token so attention need not recompute them
(they never change). Without it, generation is O(N²) in work; with it, O(N). Its size is
2 × layers × kv_heads × head_dim × dtype_bytes per token and it grows linearly with batch size
and sequence length, so at scale it dwarfs the weights and caps how many concurrent requests fit.
For Llama-3-8B that's ~128 KiB/token; a few thousand tokens × a few dozen users fills tens of GB.
Q3. Walk me through prefill vs decode.
Model answer
Prefill is the first pass over the whole prompt: many tokens, one pass, compute-bound, fills the
prompt's KV cache, determines TTFT. Decode is every subsequent single-token step: one token,
memory-bandwidth-bound (read all weights + KV), determines ITL/TPOT. The scheduler treats both
uniformly as "advance num_computed_tokens toward num_tokens," which is why chunked prefill and
continuous batching fall out naturally (Phase 3).
Q4. How would you estimate KV-cache memory for a deployment?
Model answer
kv_bytes_per_token = 2 × num_layers × num_kv_heads × head_dim × dtype_bytes; multiply by max
sequence length for per-sequence bytes; concurrent capacity ≈ (HBM − weights) / per-sequence
bytes. Watch for GQA (kv_heads ≪ query_heads shrinks it), fp8 KV cache (halves dtype_bytes), and
that real engines reserve some HBM for activations and CUDA-graph buffers, so usable KV is a bit
less than the naive free figure.
Q5. Why does batching improve throughput, and what's the cost?
Model answer
In decode, reading the model weights from HBM is the dominant cost and is shared across a batch — so processing B sequences together costs barely more than one, multiplying throughput. The cost is latency: each step does more work, and (via Little's Law) higher concurrency means each request waits longer. The scheduler navigates this; Phase 18 tunes it.
Rapid-fire
- Tokens are roughly? ~¾ of a word; integer ids from a tokenizer.
- Logits are? Pre-softmax scores over the whole vocabulary for the next token.
- Decode bottleneck? Memory bandwidth. Prefill bottleneck? Compute.
- TTFT driven by? Prefill. ITL driven by? Decode.
- KV bytes/token formula?
2 × layers × kv_heads × head_dim × dtype_bytes. - The engine's master variables?
num_computed_tokenschasingnum_tokens.