Phase 00 — Cheatsheet: Foundations

The one-liner
The loop
Prefill vs decode
KV cache
The master model
Throughput vs latency
Key upstream

The one-liner

An LLM predicts the next token; generation loops that. Serving = doing it fast for many users. Memory (the KV cache), not compute, is the cap.

The loop

tokenize → (prefill the prompt) → loop[ forward → sample → append ] → detokenize. Real: EngineCore.step = schedule → execute → sample → update (core.py:428).

Prefill vs decode

	prefill	decode
tokens/pass	many	one
bound by	compute (FLOPs)	memory bandwidth
latency	TTFT	ITL/TPOT
fills	prompt KV	one KV/step

KV cache

Exists because K/V never change once computed → cache them → O(N²) work becomes O(N).
kv_bytes_per_token = 2 × layers × kv_heads × head_dim × dtype_bytes.
Llama-3-8B fp16 ≈ 128 KiB/token. Concurrency ≈ (HBM − weights) / (per_token × seq_len).
Shrink it: GQA (fewer kv_heads), fp8 KV (half dtype), shorter context, paging (Phase 2).

The master model

A request = num_computed_tokens racing num_tokens. Prefill = far behind; decode = one behind. (vllm/v1/request.py:239; mirrored in mini_vllm/request.py.)

Throughput vs latency

Bigger batch → more throughput (amortize weight reads), worse per-request latency. Little's Law: concurrency = throughput × latency. The scheduler (Phase 3) and tuning (Phase 18) live here.

Key upstream

vllm/model_executor/models/llama.py — a real forward pass (Q/K/V at LlamaAttention.forward)
vllm/v1/request.py:239 — the counters
vllm/v1/engine/core.py:428 — EngineCore.step

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md

vLLM Mastery — From Zero to Maintainer