Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 00 — Cheatsheet: Foundations

Contents


The one-liner

An LLM predicts the next token; generation loops that. Serving = doing it fast for many users. Memory (the KV cache), not compute, is the cap.

The loop

tokenize → (prefill the prompt) → loop[ forward → sample → append ] → detokenize. Real: EngineCore.step = schedule → execute → sample → update (core.py:428).

Prefill vs decode

prefilldecode
tokens/passmanyone
bound bycompute (FLOPs)memory bandwidth
latencyTTFTITL/TPOT
fillsprompt KVone KV/step

KV cache

  • Exists because K/V never change once computed → cache them → O(N²) work becomes O(N).
  • kv_bytes_per_token = 2 × layers × kv_heads × head_dim × dtype_bytes.
  • Llama-3-8B fp16 ≈ 128 KiB/token. Concurrency ≈ (HBM − weights) / (per_token × seq_len).
  • Shrink it: GQA (fewer kv_heads), fp8 KV (half dtype), shorter context, paging (Phase 2).

The master model

A request = num_computed_tokens racing num_tokens. Prefill = far behind; decode = one behind. (vllm/v1/request.py:239; mirrored in mini_vllm/request.py.)

Throughput vs latency

Bigger batch → more throughput (amortize weight reads), worse per-request latency. Little's Law: concurrency = throughput × latency. The scheduler (Phase 3) and tuning (Phase 18) live here.

Key upstream

  • vllm/model_executor/models/llama.py — a real forward pass (Q/K/V at LlamaAttention.forward)
  • vllm/v1/request.py:239 — the counters
  • vllm/v1/engine/core.py:428EngineCore.step

Full: 00-guide.md · 01-deep-dive.md · INTERVIEW.md