Phase 00 — Exercises: Foundations
Contents
- Warm-up (explain)
- Core (the distinctions that matter)
- Build (your labs)
- Design (staff-level)
- Self-grading
Warm-up (explain)
- In one sentence: what does an LLM compute, and what is "autoregressive generation"?
- Define tokens, embeddings, logits. Where in a forward pass do logits appear?
- Why does a token's K and V never change once computed? Why does that justify a cache?
Core (the distinctions that matter)
- Fill the table from memory: prefill vs decode — tokens/pass, bottleneck (compute vs memory bandwidth), and which latency metric (TTFT vs ITL) each drives.
- Explain why decode is memory-bandwidth-bound. What must the GPU read to produce one token, and how much math does it do with it?
- Why does batching help throughput specifically during decode? (Hint: what gets amortized?)
Build (your labs)
- In lab-01, derive the exact no-cache work
sum(P..P+n-1)and the cached workP+n. What's the ratio as n → ∞ for fixed P? - In lab-02, compute
kv_bytes_per_tokenand max concurrency for a model of your choice (look up its config: layers, kv_heads, head_dim). Then redo it with fp8 KV cache. - A model uses MHA (num_kv_heads == num_query_heads). Show how switching to GQA with 8 KV heads changes KV memory and thus concurrency.
Design (staff-level)
- You must serve a 70B model at 8k context with TTFT < 1s and ITL < 50ms on 8×A100 (80GB). Estimate KV memory per sequence and reason about how many concurrent users fit. What's the first thing you'd do to fit more?
- A teammate says "let's just recompute attention each step, it's simpler." Quantify what that costs for a 2000-token generation and explain why it's a non-starter.
- Using Little's Law (concurrency = throughput × latency), if you target 1000 tok/s aggregate at 50ms ITL, how many sequences must be in flight? What limits that number?
Self-grading
4, 5, 10–12 are interview-grade. Could you whiteboard each in 5 minutes? If not, re-read the guide's prefill/decode and memory sections, then drill INTERVIEW.md.