vLLM Mastery — From Zero to Maintainer

A deep, lab-driven journey through the internals of the world's most popular open-source LLM inference engine.

This is not a tutorial. It is a 20-phase apprenticeship. If you start at Phase 0 knowing nothing about how language models run, and you finish every lab, you will be able to:

Read and modify any part of the vLLM codebase — the scheduler, the KV-cache manager, attention backends, quantization, speculative decoding, distributed execution.
Land real pull requests upstream and reason about them like a maintainer.
Operate as a principal / staff LLM-inference engineer — design serving systems, debug throughput cliffs, and make the architectural calls that decide whether a model serves 10 or 10,000 users per GPU.
Found or join a startup in the inference space and know exactly where the moats are.

Everything you need is in this repository. You will never need an outside book.

The two things that make this work
- 1. You read the real engine
- 2. You build a small engine
How each phase is structured
- Lab hardware tags
The curriculum (20 phases)
Recommended path

The two things that make this work

1. You read the real engine

Every concept is anchored to the actual vLLM source code, frozen at a single commit (see UPSTREAM_PIN.md: v0.22.1 @ 0decac0). When a phase says

vllm/v1/core/block_pool.py:333 — BlockPool.get_new_blocks()

that line really exists in ./upstream/ and you are expected to open it. We do not paraphrase the engine. We quote it and explain it line by line.

⚠️ vLLM moves fast (dozens of merged PRs per day). Line numbers are valid only at the pinned commit. The named class/function is always given so you can re-find it in any version. Re-create the exact tree with the command in UPSTREAM_PIN.md.

Reading is not understanding. So in parallel you build mini_vllm/ — a deliberately small, dependency-light reimplementation of vLLM's core ideas that runs on a laptop CPU, no GPU required. By the end you will have written, with your own hands:

a paged KV-cache block allocator (Phase 2),
a continuous-batching scheduler with prefix caching (Phase 3),
a sampler, an n-gram speculative decoder, a batched-LoRA matmul, a grammar mask, …

The real engine teaches you what production looks like. The mini engine teaches you why every decision was made. You need both. This is the "Both" anchoring this course is built on.

How each phase is structured

Every phase-NN-*/ folder has the same shape:

File	What it is
`00-guide.md`	The Hitchhiker's Guide to the topic. Don't Panic. Pure intuition, analogies, ASCII diagrams. Assumes you know nothing. Read this first.
`01-deep-dive.md`	The real implementation. Upstream `path:line` references, quoted excerpts, line-by-line explanation, data structures, edge cases.
`02-mini-build.md`	Build or extend the `mini_vllm/` component for this topic.
`labs/lab-NN-*/`	Hands-on labs: `README.md` + `starter.py` + `solution.py` + `test_lab.py`.
`EXERCISES.md`	Graded challenges, easy → staff-level, with hints and solutions.
`INTERVIEW.md`	Real staff/principal interview questions on the topic, with model answers.
`CHEATSHEET.md`	One page: APIs, invariants, performance knobs, gotchas.

Lab hardware tags

Not everyone has a GPU. Every lab is tagged:

[CPU-OK] — runs anywhere, including the CI on your laptop. Most labs.
[GPU-OPT] — better on a GPU but has a CPU fallback; expected GPU output is captured in the README so you can follow along without one.
[GPU-REQ] — genuinely needs an NVIDIA GPU (real CUDA kernels). The README includes captured output and a step-by-step so you learn even if you only rent a GPU later.

See SETUP.md for environment setup and cheap cloud-GPU options.

The curriculum (20 phases)

#	Phase	One-line goal
00	Foundations	What an LLM forward pass is; prefill vs decode; why the KV cache exists.
01	Architecture & Request Lifecycle	Trace one request from `LLM.generate()` to tokens out.
02	PagedAttention ⭐	How vLLM stores KV memory in pages and never fragments.
03	Continuous Batching & Scheduler ⭐	Iteration-level scheduling, chunked prefill, prefix caching, preemption.
04	Attention Backends	FlashAttention, FlashInfer, TRTLLM-GEN, FlashMLA, Triton.
05	CUDA Graphs & torch.compile	Piecewise vs full graphs; the compilation pipeline.
06	Quantization	FP8/MXFP4/NVFP4/INT8/INT4, GPTQ/AWQ/GGUF/compressed-tensors.
07	GEMM & MoE Kernels	CUTLASS GEMM; MoE routing & grouped GEMM; expert parallelism.
08	Speculative Decoding	n-gram, suffix, EAGLE, DFlash; draft/verify & rejection sampling.
09	Sampling & Decoding Algorithms	top-k/p, penalties, parallel sampling, beam search, logits processors.
10	Distributed Inference	Tensor / Pipeline / Data / Expert / Context parallelism.
11	Multi-LoRA	Batched adapters, punica/SGMV, dense + MoE LoRA.
12	Structured Outputs	Grammar-constrained decoding via xgrammar / guidance.
13	Multimodal Models	Vision encoders, image-token merging, processor cache.
14	Model Architectures	Add a model: decoder-only, MoE, hybrid/SSM, embedding/reward.
15	Disaggregated Serving	Prefill/decode/encode split; KV transfer connectors.
16	Serving APIs & Parsers	OpenAI & Anthropic APIs, gRPC, streaming, tool/reasoning parsers.
17	Hardware Backends & Plugins	The platform abstraction; NVIDIA/AMD/CPU/TPU plugins.
18	Performance Engineering	Profiling, benchmarking, roofline thinking, tuning knobs.
19	Capstone — Maintainer & Startup	Land a real PR; the staff competency map; the startup playbook.

⭐ = the original flagship phases that set the template. Every phase now has fully written labs — 60+ in total, each with an in-depth guide-style README, and (for the CPU labs) a tested starter.py / solution.py / test_lab.py triplet. Run the whole suite with pytest -m "not gpu" from the repo root; every phase's labs/README.md gives the recommended order and the skills each lab delivers.

Recommended path

Do them in order, 0 → 19. Each builds on the last; mini_vllm/ grows phase by phase.
For each phase: read 00-guide.md → read 01-deep-dive.md with upstream/ open in a second window → do 02-mini-build.md → run the labs → attempt EXERCISES.md → self-test with INTERVIEW.md.
Run the tests constantly: pytest -m "not gpu" from the repo root.
Keep a lab notebook. When you finish, your notebook + mini_vllm/ + a merged upstream PR is your portfolio.

Start here: SETUP.md, then phase-00-foundations/00-guide.md.

See also: GLOSSARY.md (every term defined once) and CAREER.md (the maintainer path, the staff competency map, the startup playbook).

This repo also builds as a website (mdBook → Cloudflare Pages): see PUBLISHING.md.

vLLM Mastery — From Zero to Maintainer

vLLM Mastery — From Zero to Maintainer

Contents

The two things that make this work

1. You read the real engine

2. You build a small engine

How each phase is structured

Lab hardware tags

The curriculum (20 phases)

Recommended path

Keyboard shortcuts

vLLM Mastery — From Zero to Maintainer