Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 00 — Exercises: Foundations

Contents


Warm-up (explain)

  1. In one sentence: what does an LLM compute, and what is "autoregressive generation"?
  2. Define tokens, embeddings, logits. Where in a forward pass do logits appear?
  3. Why does a token's K and V never change once computed? Why does that justify a cache?

Core (the distinctions that matter)

  1. Fill the table from memory: prefill vs decode — tokens/pass, bottleneck (compute vs memory bandwidth), and which latency metric (TTFT vs ITL) each drives.
  2. Explain why decode is memory-bandwidth-bound. What must the GPU read to produce one token, and how much math does it do with it?
  3. Why does batching help throughput specifically during decode? (Hint: what gets amortized?)

Build (your labs)

  1. In lab-01, derive the exact no-cache work sum(P..P+n-1) and the cached work P+n. What's the ratio as n → ∞ for fixed P?
  2. In lab-02, compute kv_bytes_per_token and max concurrency for a model of your choice (look up its config: layers, kv_heads, head_dim). Then redo it with fp8 KV cache.
  3. A model uses MHA (num_kv_heads == num_query_heads). Show how switching to GQA with 8 KV heads changes KV memory and thus concurrency.

Design (staff-level)

  1. You must serve a 70B model at 8k context with TTFT < 1s and ITL < 50ms on 8×A100 (80GB). Estimate KV memory per sequence and reason about how many concurrent users fit. What's the first thing you'd do to fit more?
  2. A teammate says "let's just recompute attention each step, it's simpler." Quantify what that costs for a 2000-token generation and explain why it's a non-starter.
  3. Using Little's Law (concurrency = throughput × latency), if you target 1000 tok/s aggregate at 50ms ITL, how many sequences must be in flight? What limits that number?

Self-grading

4, 5, 10–12 are interview-grade. Could you whiteboard each in 5 minutes? If not, re-read the guide's prefill/decode and memory sections, then drill INTERVIEW.md.