Phase 00 — Mini-Build: feel the KV-cache win
This phase doesn't add a new mini_vllm module — it has you understand the three you already
have and measure the foundational result the whole course rests on.
Contents
- Part A — read the three pieces of "next-token prediction"
- Part B — the lab: O(N²) vs O(N)
- Definition of done
- Map to the real engine
Part A — read the three pieces of "next-token prediction"
Open and read (they're tiny):
mini_vllm/tokenizer.py—encode(str) -> list[int],decode(list[int]) -> str. Same interface as a real HF tokenizer.mini_vllm/model.py—ToyModel.forward(last_tokens, positions) -> logits. Batched, autoregressive, deterministic. (It ignores KV values — honest simplification noted in the file — because this course cares about KV memory management, not the toy's numerics.)mini_vllm/sampler.py—Sampler.sample(logits, params) -> token_id. Greedy at temperature 0.
Trace one LLMEngine.generate(["hi"]) call in your head: tokenize → loop step() → sample →
detokenize. Confirm it matches EngineCore.step from the deep-dive.
Part B — the lab: O(N²) vs O(N)
lab-01-kv-cache-speedup is the build. You implement a toy attention twice:
- No cache: every step recomputes K/V for all prior tokens → work grows with the prefix.
- Cached: K/V computed once, reused → constant work per step.
You'll count "K/V computations" and show the no-cache version does 1+2+...+N = O(N²) while the
cached version does O(N) — and that both produce the identical token sequence (caching is an
optimization, never a correctness change — the same invariant you'll prove for chunked prefill
and preemption in Phase 3).
lab-02-kv-memory-calculator has you write the kv_bytes_per_token formula and compute how many
concurrent sequences fit on a given GPU — the number that caps your serving capacity.
Definition of done
pytest phase-00-foundations/labs -q
Then answer in your notebook:
- What is the asymptotic work ratio (no-cache / cached) for generating N tokens? (≈ N/2.)
- For Llama-3-8B at 8k context on a 24 GB GPU (say 16 GB weights), roughly how many concurrent full-length sequences fit? (You'll compute it in lab-02 — the answer is "surprisingly few," which is the entire motivation for Phase 2.)
Map to the real engine
| your understanding | real vLLM |
|---|---|
| the no-cache vs cache experiment | why vllm.attention.layer.Attention caches K/V at all |
num_computed vs num_tokens | Request counters (request.py:239) |
| tokenize→loop→sample→detokenize | EngineCore.step (core.py:428) |
| kv_bytes_per_token formula | how get_kv_cache_configs sizes the block pool (Phase 2) |