Phase 00 — Exercises: Foundations

Warm-up (explain)
Core (the distinctions that matter)
Build (your labs)
Design (staff-level)
Self-grading

Warm-up (explain)

In one sentence: what does an LLM compute, and what is "autoregressive generation"?
Define tokens, embeddings, logits. Where in a forward pass do logits appear?
Why does a token's K and V never change once computed? Why does that justify a cache?

Core (the distinctions that matter)

Fill the table from memory: prefill vs decode — tokens/pass, bottleneck (compute vs memory bandwidth), and which latency metric (TTFT vs ITL) each drives.
Explain why decode is memory-bandwidth-bound. What must the GPU read to produce one token, and how much math does it do with it?
Why does batching help throughput specifically during decode? (Hint: what gets amortized?)

Build (your labs)

In lab-01, derive the exact no-cache work sum(P..P+n-1) and the cached work P+n. What's the ratio as n → ∞ for fixed P?
In lab-02, compute kv_bytes_per_token and max concurrency for a model of your choice (look up its config: layers, kv_heads, head_dim). Then redo it with fp8 KV cache.
A model uses MHA (num_kv_heads == num_query_heads). Show how switching to GQA with 8 KV heads changes KV memory and thus concurrency.

Design (staff-level)

You must serve a 70B model at 8k context with TTFT < 1s and ITL < 50ms on 8×A100 (80GB). Estimate KV memory per sequence and reason about how many concurrent users fit. What's the first thing you'd do to fit more?
A teammate says "let's just recompute attention each step, it's simpler." Quantify what that costs for a 2000-token generation and explain why it's a non-starter.
Using Little's Law (concurrency = throughput × latency), if you target 1000 tok/s aggregate at 50ms ITL, how many sequences must be in flight? What limits that number?

Self-grading

4, 5, 10–12 are interview-grade. Could you whiteboard each in 5 minutes? If not, re-read the guide's prefill/decode and memory sections, then drill INTERVIEW.md.

vLLM Mastery — From Zero to Maintainer

Phase 00 — Exercises: Foundations

Contents

Warm-up (explain)

Core (the distinctions that matter)

Build (your labs)

Design (staff-level)

Self-grading

Keyboard shortcuts

vLLM Mastery — From Zero to Maintainer

Phase 00 — Exercises: Foundations

Contents

Warm-up (explain)

Core (the distinctions that matter)

Build (your labs)

Design (staff-level)

Self-grading