Phase 00 Labs — Foundations

Four labs that install the four facts everything else stands on: generation is autoregressive and caching makes it linear (lab-01), the cache is memory and memory is the binding constraint (lab-02), logits become tokens through a small exact algorithm (lab-03), and prefill and decode live in opposite performance regimes (lab-04). No GPU, no model downloads — counters, formulas, and numpy. Do them in order; each ends where the next begins.

Every lab follows the standard contract: starter.py with TODOs (your work), solution.py (the reference), test_lab.py (the spec, executable). The default test run uses solution.py so the suite is always green; set LAB_IMPL=starter to grade yourself.

# Whole phase:
pytest phase-00-foundations/labs -m "not gpu"

# Grade your own work on one lab:
LAB_IMPL=starter pytest phase-00-foundations/labs/lab-01-kv-cache-speedup -q

lab-01-kv-cache-speedup [CPU-OK]
lab-02-kv-memory-calculator [CPU-OK]
lab-03-sampling-basics [CPU-OK]
lab-04-prefill-vs-decode [CPU-OK]
What you can do after this phase

Labs

lab-01-kv-cache-speedup `[CPU-OK]`

The experiment that motivates the course: implement generation with and without a KV cache, count the work exactly (95 vs 15 units; >100× by n=1000), and prove both produce identical tokens. The O(N²) → O(N) trade that converts compute into memory — and creates the prefill/decode split as a side effect. Skills: why the cache exists; causality makes K/V cacheable; counting beats clocking; the master "optimization changes nothing" invariant.

lab-02-kv-memory-calculator `[CPU-OK]`

Write the three-line formula behind every capacity decision in LLM serving and apply it to Llama-3-8B: 128 KiB per token, 256 MiB per sequence, ~32 concurrent users on a 24 GiB GPU. Then read FP8-KV and GQA as factors of the formula. Memory, not compute, is the constraint — derived, not asserted. Skills: back-of-envelope capacity planning; the formula as an optimization roadmap; weights are rent, KV is traffic.

lab-03-sampling-basics `[CPU-OK]`

Build the sampler: greedy, temperature, top-k, top-p — with the stability clause (softmax max-subtraction), the inclusive nucleus boundary, and seeded reproducibility. The final test proves your sampler agrees token-for-token with mini_vllm's engine sampler across 15 configurations. Skills: the four knobs as exact algorithms; −∞ masking; why greedy mode anchors every deterministic test in this course.

lab-04-prefill-vs-decode `[CPU-OK]`

Six one-line functions and an A100 spec sheet: the ridge point (156 FLOPs/byte), single-stream decode at 0.6% compute utilization, the 125 tok/s physical speed limit for 8B/fp16, and the critical batch size where decode becomes compute-bound. The roofline worldview that sorts every optimization into "helps my regime" or "doesn't." Skills: compute-bound vs memory-bound as a reflex; the intensity cancellation (model size doesn't matter — tokens per weight-trip does); why batching is free money and quantization is a decode feature.

What you can do after this phase

Derive, on a whiteboard with no notes: why every inference engine caches KV (and what it costs in bytes); how many users fit on a given GPU for a given model (and which knob to turn when the answer is too small); what temperature=0.7, top_p=0.9 actually computes; and whether a proposed optimization can possibly help a given workload (which side of the ridge is it on?). These four reflexes are the entrance exam for Phase 1, where the loop you simulated becomes a real engine with a scheduler, and for every phase after it.

vLLM Mastery — From Zero to Maintainer