Phase 00 — Cheatsheet: Foundations
Contents
- The one-liner
- The loop
- Prefill vs decode
- KV cache
- The master model
- Throughput vs latency
- Key upstream
The one-liner
An LLM predicts the next token; generation loops that. Serving = doing it fast for many users. Memory (the KV cache), not compute, is the cap.
The loop
tokenize → (prefill the prompt) → loop[ forward → sample → append ] → detokenize.
Real: EngineCore.step = schedule → execute → sample → update (core.py:428).
Prefill vs decode
| prefill | decode | |
|---|---|---|
| tokens/pass | many | one |
| bound by | compute (FLOPs) | memory bandwidth |
| latency | TTFT | ITL/TPOT |
| fills | prompt KV | one KV/step |
KV cache
- Exists because K/V never change once computed → cache them → O(N²) work becomes O(N).
kv_bytes_per_token = 2 × layers × kv_heads × head_dim × dtype_bytes.- Llama-3-8B fp16 ≈ 128 KiB/token. Concurrency ≈ (HBM − weights) / (per_token × seq_len).
- Shrink it: GQA (fewer kv_heads), fp8 KV (half dtype), shorter context, paging (Phase 2).
The master model
A request = num_computed_tokens racing num_tokens. Prefill = far behind; decode = one behind.
(vllm/v1/request.py:239; mirrored in mini_vllm/request.py.)
Throughput vs latency
Bigger batch → more throughput (amortize weight reads), worse per-request latency. Little's Law: concurrency = throughput × latency. The scheduler (Phase 3) and tuning (Phase 18) live here.
Key upstream
vllm/model_executor/models/llama.py— a real forward pass (Q/K/V atLlamaAttention.forward)vllm/v1/request.py:239— the countersvllm/v1/engine/core.py:428—EngineCore.step
Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md