Phase 00 Labs — Foundations
Four labs that install the four facts everything else stands on: generation is autoregressive and caching makes it linear (lab-01), the cache is memory and memory is the binding constraint (lab-02), logits become tokens through a small exact algorithm (lab-03), and prefill and decode live in opposite performance regimes (lab-04). No GPU, no model downloads — counters, formulas, and numpy. Do them in order; each ends where the next begins.
Every lab follows the standard contract: starter.py with TODOs (your work),
solution.py (the reference), test_lab.py (the spec, executable). The default test run
uses solution.py so the suite is always green; set LAB_IMPL=starter to grade yourself.
# Whole phase:
pytest phase-00-foundations/labs -m "not gpu"
# Grade your own work on one lab:
LAB_IMPL=starter pytest phase-00-foundations/labs/lab-01-kv-cache-speedup -q
Contents
- lab-01-kv-cache-speedup
[CPU-OK] - lab-02-kv-memory-calculator
[CPU-OK] - lab-03-sampling-basics
[CPU-OK] - lab-04-prefill-vs-decode
[CPU-OK] - What you can do after this phase
Labs
lab-01-kv-cache-speedup [CPU-OK]
The experiment that motivates the course: implement generation with and without a KV cache, count the work exactly (95 vs 15 units; >100× by n=1000), and prove both produce identical tokens. The O(N²) → O(N) trade that converts compute into memory — and creates the prefill/decode split as a side effect. Skills: why the cache exists; causality makes K/V cacheable; counting beats clocking; the master "optimization changes nothing" invariant.
lab-02-kv-memory-calculator [CPU-OK]
Write the three-line formula behind every capacity decision in LLM serving and apply it to Llama-3-8B: 128 KiB per token, 256 MiB per sequence, ~32 concurrent users on a 24 GiB GPU. Then read FP8-KV and GQA as factors of the formula. Memory, not compute, is the constraint — derived, not asserted. Skills: back-of-envelope capacity planning; the formula as an optimization roadmap; weights are rent, KV is traffic.
lab-03-sampling-basics [CPU-OK]
Build the sampler: greedy, temperature, top-k, top-p — with the stability clause (softmax
max-subtraction), the inclusive nucleus boundary, and seeded reproducibility. The final
test proves your sampler agrees token-for-token with mini_vllm's engine sampler across
15 configurations. Skills: the four knobs as exact algorithms; −∞ masking; why greedy
mode anchors every deterministic test in this course.
lab-04-prefill-vs-decode [CPU-OK]
Six one-line functions and an A100 spec sheet: the ridge point (156 FLOPs/byte), single-stream decode at 0.6% compute utilization, the 125 tok/s physical speed limit for 8B/fp16, and the critical batch size where decode becomes compute-bound. The roofline worldview that sorts every optimization into "helps my regime" or "doesn't." Skills: compute-bound vs memory-bound as a reflex; the intensity cancellation (model size doesn't matter — tokens per weight-trip does); why batching is free money and quantization is a decode feature.
What you can do after this phase
Derive, on a whiteboard with no notes: why every inference engine caches KV (and what it
costs in bytes); how many users fit on a given GPU for a given model (and which knob to
turn when the answer is too small); what temperature=0.7, top_p=0.9 actually computes;
and whether a proposed optimization can possibly help a given workload (which side of the
ridge is it on?). These four reflexes are the entrance exam for Phase 1, where the loop
you simulated becomes a real engine with a scheduler, and for every phase after it.