Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 00 — Mini-Build: feel the KV-cache win

This phase doesn't add a new mini_vllm module — it has you understand the three you already have and measure the foundational result the whole course rests on.

Contents


Part A — read the three pieces of "next-token prediction"

Open and read (they're tiny):

  • mini_vllm/tokenizer.pyencode(str) -> list[int], decode(list[int]) -> str. Same interface as a real HF tokenizer.
  • mini_vllm/model.pyToyModel.forward(last_tokens, positions) -> logits. Batched, autoregressive, deterministic. (It ignores KV values — honest simplification noted in the file — because this course cares about KV memory management, not the toy's numerics.)
  • mini_vllm/sampler.pySampler.sample(logits, params) -> token_id. Greedy at temperature 0.

Trace one LLMEngine.generate(["hi"]) call in your head: tokenize → loop step() → sample → detokenize. Confirm it matches EngineCore.step from the deep-dive.

Part B — the lab: O(N²) vs O(N)

lab-01-kv-cache-speedup is the build. You implement a toy attention twice:

  1. No cache: every step recomputes K/V for all prior tokens → work grows with the prefix.
  2. Cached: K/V computed once, reused → constant work per step.

You'll count "K/V computations" and show the no-cache version does 1+2+...+N = O(N²) while the cached version does O(N)and that both produce the identical token sequence (caching is an optimization, never a correctness change — the same invariant you'll prove for chunked prefill and preemption in Phase 3).

lab-02-kv-memory-calculator has you write the kv_bytes_per_token formula and compute how many concurrent sequences fit on a given GPU — the number that caps your serving capacity.

Definition of done

pytest phase-00-foundations/labs -q

Then answer in your notebook:

  • What is the asymptotic work ratio (no-cache / cached) for generating N tokens? (≈ N/2.)
  • For Llama-3-8B at 8k context on a 24 GB GPU (say 16 GB weights), roughly how many concurrent full-length sequences fit? (You'll compute it in lab-02 — the answer is "surprisingly few," which is the entire motivation for Phase 2.)

Map to the real engine

your understandingreal vLLM
the no-cache vs cache experimentwhy vllm.attention.layer.Attention caches K/V at all
num_computed vs num_tokensRequest counters (request.py:239)
tokenize→loop→sample→detokenizeEngineCore.step (core.py:428)
kv_bytes_per_token formulahow get_kv_cache_configs sizes the block pool (Phase 2)