Phase 00 — Mini-Build: feel the KV-cache win

This phase doesn't add a new mini_vllm module — it has you understand the three you already have and measure the foundational result the whole course rests on.

Part A — read the three pieces of "next-token prediction"
Part B — the lab: O(N²) vs O(N)
Definition of done
Map to the real engine

Part A — read the three pieces of "next-token prediction"

Open and read (they're tiny):

mini_vllm/tokenizer.py — encode(str) -> list[int], decode(list[int]) -> str. Same interface as a real HF tokenizer.
mini_vllm/model.py — ToyModel.forward(last_tokens, positions) -> logits. Batched, autoregressive, deterministic. (It ignores KV values — honest simplification noted in the file — because this course cares about KV memory management, not the toy's numerics.)
mini_vllm/sampler.py — Sampler.sample(logits, params) -> token_id. Greedy at temperature 0.

Trace one LLMEngine.generate(["hi"]) call in your head: tokenize → loop step() → sample → detokenize. Confirm it matches EngineCore.step from the deep-dive.

Part B — the lab: O(N²) vs O(N)

lab-01-kv-cache-speedup is the build. You implement a toy attention twice:

No cache: every step recomputes K/V for all prior tokens → work grows with the prefix.
Cached: K/V computed once, reused → constant work per step.

You'll count "K/V computations" and show the no-cache version does 1+2+...+N = O(N²) while the cached version does O(N) — and that both produce the identical token sequence (caching is an optimization, never a correctness change — the same invariant you'll prove for chunked prefill and preemption in Phase 3).

lab-02-kv-memory-calculator has you write the kv_bytes_per_token formula and compute how many concurrent sequences fit on a given GPU — the number that caps your serving capacity.

Definition of done

pytest phase-00-foundations/labs -q

Then answer in your notebook:

What is the asymptotic work ratio (no-cache / cached) for generating N tokens? (≈ N/2.)
For Llama-3-8B at 8k context on a 24 GB GPU (say 16 GB weights), roughly how many concurrent full-length sequences fit? (You'll compute it in lab-02 — the answer is "surprisingly few," which is the entire motivation for Phase 2.)

Map to the real engine

your understanding	real vLLM
the no-cache vs cache experiment	why `vllm.attention.layer.Attention` caches K/V at all
`num_computed` vs `num_tokens`	`Request` counters (`request.py:239`)
tokenize→loop→sample→detokenize	`EngineCore.step` (`core.py:428`)
kv_bytes_per_token formula	how `get_kv_cache_configs` sizes the block pool (Phase 2)

vLLM Mastery — From Zero to Maintainer

Phase 00 — Mini-Build: feel the KV-cache win

Contents

Part A — read the three pieces of "next-token prediction"

Part B — the lab: O(N²) vs O(N)

Definition of done

Map to the real engine

Keyboard shortcuts

vLLM Mastery — From Zero to Maintainer

Phase 00 — Mini-Build: feel the KV-cache win

Contents

Part A — read the three pieces of "next-token prediction"

Part B — the lab: O(N²) vs O(N)

Definition of done

Map to the real engine