Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

vLLM Mastery — From Zero to Maintainer

A deep, lab-driven journey through the internals of the world's most popular open-source LLM inference engine.

This is not a tutorial. It is a 20-phase apprenticeship. If you start at Phase 0 knowing nothing about how language models run, and you finish every lab, you will be able to:

  • Read and modify any part of the vLLM codebase — the scheduler, the KV-cache manager, attention backends, quantization, speculative decoding, distributed execution.
  • Land real pull requests upstream and reason about them like a maintainer.
  • Operate as a principal / staff LLM-inference engineer — design serving systems, debug throughput cliffs, and make the architectural calls that decide whether a model serves 10 or 10,000 users per GPU.
  • Found or join a startup in the inference space and know exactly where the moats are.

Everything you need is in this repository. You will never need an outside book.


Contents


The two things that make this work

1. You read the real engine

Every concept is anchored to the actual vLLM source code, frozen at a single commit (see UPSTREAM_PIN.md: v0.22.1 @ 0decac0). When a phase says

vllm/v1/core/block_pool.py:333BlockPool.get_new_blocks()

that line really exists in ./upstream/ and you are expected to open it. We do not paraphrase the engine. We quote it and explain it line by line.

⚠️ vLLM moves fast (dozens of merged PRs per day). Line numbers are valid only at the pinned commit. The named class/function is always given so you can re-find it in any version. Re-create the exact tree with the command in UPSTREAM_PIN.md.

2. You build a small engine

Reading is not understanding. So in parallel you build mini_vllm/ — a deliberately small, dependency-light reimplementation of vLLM's core ideas that runs on a laptop CPU, no GPU required. By the end you will have written, with your own hands:

  • a paged KV-cache block allocator (Phase 2),
  • a continuous-batching scheduler with prefix caching (Phase 3),
  • a sampler, an n-gram speculative decoder, a batched-LoRA matmul, a grammar mask, …

The real engine teaches you what production looks like. The mini engine teaches you why every decision was made. You need both. This is the "Both" anchoring this course is built on.


How each phase is structured

Every phase-NN-*/ folder has the same shape:

FileWhat it is
00-guide.mdThe Hitchhiker's Guide to the topic. Don't Panic. Pure intuition, analogies, ASCII diagrams. Assumes you know nothing. Read this first.
01-deep-dive.mdThe real implementation. Upstream path:line references, quoted excerpts, line-by-line explanation, data structures, edge cases.
02-mini-build.mdBuild or extend the mini_vllm/ component for this topic.
labs/lab-NN-*/Hands-on labs: README.md + starter.py + solution.py + test_lab.py.
EXERCISES.mdGraded challenges, easy → staff-level, with hints and solutions.
INTERVIEW.mdReal staff/principal interview questions on the topic, with model answers.
CHEATSHEET.mdOne page: APIs, invariants, performance knobs, gotchas.

Lab hardware tags

Not everyone has a GPU. Every lab is tagged:

  • [CPU-OK] — runs anywhere, including the CI on your laptop. Most labs.
  • [GPU-OPT] — better on a GPU but has a CPU fallback; expected GPU output is captured in the README so you can follow along without one.
  • [GPU-REQ] — genuinely needs an NVIDIA GPU (real CUDA kernels). The README includes captured output and a step-by-step so you learn even if you only rent a GPU later.

See SETUP.md for environment setup and cheap cloud-GPU options.


The curriculum (20 phases)

#PhaseOne-line goal
00FoundationsWhat an LLM forward pass is; prefill vs decode; why the KV cache exists.
01Architecture & Request LifecycleTrace one request from LLM.generate() to tokens out.
02PagedAttentionHow vLLM stores KV memory in pages and never fragments.
03Continuous Batching & SchedulerIteration-level scheduling, chunked prefill, prefix caching, preemption.
04Attention BackendsFlashAttention, FlashInfer, TRTLLM-GEN, FlashMLA, Triton.
05CUDA Graphs & torch.compilePiecewise vs full graphs; the compilation pipeline.
06QuantizationFP8/MXFP4/NVFP4/INT8/INT4, GPTQ/AWQ/GGUF/compressed-tensors.
07GEMM & MoE KernelsCUTLASS GEMM; MoE routing & grouped GEMM; expert parallelism.
08Speculative Decodingn-gram, suffix, EAGLE, DFlash; draft/verify & rejection sampling.
09Sampling & Decoding Algorithmstop-k/p, penalties, parallel sampling, beam search, logits processors.
10Distributed InferenceTensor / Pipeline / Data / Expert / Context parallelism.
11Multi-LoRABatched adapters, punica/SGMV, dense + MoE LoRA.
12Structured OutputsGrammar-constrained decoding via xgrammar / guidance.
13Multimodal ModelsVision encoders, image-token merging, processor cache.
14Model ArchitecturesAdd a model: decoder-only, MoE, hybrid/SSM, embedding/reward.
15Disaggregated ServingPrefill/decode/encode split; KV transfer connectors.
16Serving APIs & ParsersOpenAI & Anthropic APIs, gRPC, streaming, tool/reasoning parsers.
17Hardware Backends & PluginsThe platform abstraction; NVIDIA/AMD/CPU/TPU plugins.
18Performance EngineeringProfiling, benchmarking, roofline thinking, tuning knobs.
19Capstone — Maintainer & StartupLand a real PR; the staff competency map; the startup playbook.

⭐ = the original flagship phases that set the template. Every phase now has fully written labs — 60+ in total, each with an in-depth guide-style README, and (for the CPU labs) a tested starter.py / solution.py / test_lab.py triplet. Run the whole suite with pytest -m "not gpu" from the repo root; every phase's labs/README.md gives the recommended order and the skills each lab delivers.


  1. Do them in order, 0 → 19. Each builds on the last; mini_vllm/ grows phase by phase.
  2. For each phase: read 00-guide.md → read 01-deep-dive.md with upstream/ open in a second window → do 02-mini-build.md → run the labs → attempt EXERCISES.md → self-test with INTERVIEW.md.
  3. Run the tests constantly: pytest -m "not gpu" from the repo root.
  4. Keep a lab notebook. When you finish, your notebook + mini_vllm/ + a merged upstream PR is your portfolio.

Start here: SETUP.md, then phase-00-foundations/00-guide.md.

See also: GLOSSARY.md (every term defined once) and CAREER.md (the maintainer path, the staff competency map, the startup playbook).

This repo also builds as a website (mdBook → Cloudflare Pages): see PUBLISHING.md.