vLLM Mastery — From Zero to Maintainer
A deep, lab-driven journey through the internals of the world's most popular open-source LLM inference engine.
This is not a tutorial. It is a 20-phase apprenticeship. If you start at Phase 0 knowing nothing about how language models run, and you finish every lab, you will be able to:
- Read and modify any part of the vLLM codebase — the scheduler, the KV-cache manager, attention backends, quantization, speculative decoding, distributed execution.
- Land real pull requests upstream and reason about them like a maintainer.
- Operate as a principal / staff LLM-inference engineer — design serving systems, debug throughput cliffs, and make the architectural calls that decide whether a model serves 10 or 10,000 users per GPU.
- Found or join a startup in the inference space and know exactly where the moats are.
Everything you need is in this repository. You will never need an outside book.
Contents
- The two things that make this work
- How each phase is structured
- The curriculum (20 phases)
- Recommended path
The two things that make this work
1. You read the real engine
Every concept is anchored to the actual vLLM source code, frozen at a single commit
(see UPSTREAM_PIN.md: v0.22.1 @ 0decac0). When a phase says
vllm/v1/core/block_pool.py:333—BlockPool.get_new_blocks()
that line really exists in ./upstream/ and you are expected to open it. We do not
paraphrase the engine. We quote it and explain it line by line.
⚠️ vLLM moves fast (dozens of merged PRs per day). Line numbers are valid only at the pinned commit. The named class/function is always given so you can re-find it in any version. Re-create the exact tree with the command in UPSTREAM_PIN.md.
2. You build a small engine
Reading is not understanding. So in parallel you build mini_vllm/ — a deliberately
small, dependency-light reimplementation of vLLM's core ideas that runs on a laptop CPU,
no GPU required. By the end you will have written, with your own hands:
- a paged KV-cache block allocator (Phase 2),
- a continuous-batching scheduler with prefix caching (Phase 3),
- a sampler, an n-gram speculative decoder, a batched-LoRA matmul, a grammar mask, …
The real engine teaches you what production looks like. The mini engine teaches you why every decision was made. You need both. This is the "Both" anchoring this course is built on.
How each phase is structured
Every phase-NN-*/ folder has the same shape:
| File | What it is |
|---|---|
00-guide.md | The Hitchhiker's Guide to the topic. Don't Panic. Pure intuition, analogies, ASCII diagrams. Assumes you know nothing. Read this first. |
01-deep-dive.md | The real implementation. Upstream path:line references, quoted excerpts, line-by-line explanation, data structures, edge cases. |
02-mini-build.md | Build or extend the mini_vllm/ component for this topic. |
labs/lab-NN-*/ | Hands-on labs: README.md + starter.py + solution.py + test_lab.py. |
EXERCISES.md | Graded challenges, easy → staff-level, with hints and solutions. |
INTERVIEW.md | Real staff/principal interview questions on the topic, with model answers. |
CHEATSHEET.md | One page: APIs, invariants, performance knobs, gotchas. |
Lab hardware tags
Not everyone has a GPU. Every lab is tagged:
[CPU-OK]— runs anywhere, including the CI on your laptop. Most labs.[GPU-OPT]— better on a GPU but has a CPU fallback; expected GPU output is captured in the README so you can follow along without one.[GPU-REQ]— genuinely needs an NVIDIA GPU (real CUDA kernels). The README includes captured output and a step-by-step so you learn even if you only rent a GPU later.
See SETUP.md for environment setup and cheap cloud-GPU options.
The curriculum (20 phases)
| # | Phase | One-line goal |
|---|---|---|
| 00 | Foundations | What an LLM forward pass is; prefill vs decode; why the KV cache exists. |
| 01 | Architecture & Request Lifecycle | Trace one request from LLM.generate() to tokens out. |
| 02 | PagedAttention ⭐ | How vLLM stores KV memory in pages and never fragments. |
| 03 | Continuous Batching & Scheduler ⭐ | Iteration-level scheduling, chunked prefill, prefix caching, preemption. |
| 04 | Attention Backends | FlashAttention, FlashInfer, TRTLLM-GEN, FlashMLA, Triton. |
| 05 | CUDA Graphs & torch.compile | Piecewise vs full graphs; the compilation pipeline. |
| 06 | Quantization | FP8/MXFP4/NVFP4/INT8/INT4, GPTQ/AWQ/GGUF/compressed-tensors. |
| 07 | GEMM & MoE Kernels | CUTLASS GEMM; MoE routing & grouped GEMM; expert parallelism. |
| 08 | Speculative Decoding | n-gram, suffix, EAGLE, DFlash; draft/verify & rejection sampling. |
| 09 | Sampling & Decoding Algorithms | top-k/p, penalties, parallel sampling, beam search, logits processors. |
| 10 | Distributed Inference | Tensor / Pipeline / Data / Expert / Context parallelism. |
| 11 | Multi-LoRA | Batched adapters, punica/SGMV, dense + MoE LoRA. |
| 12 | Structured Outputs | Grammar-constrained decoding via xgrammar / guidance. |
| 13 | Multimodal Models | Vision encoders, image-token merging, processor cache. |
| 14 | Model Architectures | Add a model: decoder-only, MoE, hybrid/SSM, embedding/reward. |
| 15 | Disaggregated Serving | Prefill/decode/encode split; KV transfer connectors. |
| 16 | Serving APIs & Parsers | OpenAI & Anthropic APIs, gRPC, streaming, tool/reasoning parsers. |
| 17 | Hardware Backends & Plugins | The platform abstraction; NVIDIA/AMD/CPU/TPU plugins. |
| 18 | Performance Engineering | Profiling, benchmarking, roofline thinking, tuning knobs. |
| 19 | Capstone — Maintainer & Startup | Land a real PR; the staff competency map; the startup playbook. |
⭐ = the original flagship phases that set the template. Every phase now has fully
written labs — 60+ in total, each with an in-depth guide-style README, and (for the
CPU labs) a tested starter.py / solution.py / test_lab.py triplet. Run the whole
suite with pytest -m "not gpu" from the repo root; every phase's labs/README.md
gives the recommended order and the skills each lab delivers.
Recommended path
- Do them in order, 0 → 19. Each builds on the last;
mini_vllm/grows phase by phase. - For each phase: read
00-guide.md→ read01-deep-dive.mdwithupstream/open in a second window → do02-mini-build.md→ run the labs → attemptEXERCISES.md→ self-test withINTERVIEW.md. - Run the tests constantly:
pytest -m "not gpu"from the repo root. - Keep a lab notebook. When you finish, your notebook +
mini_vllm/+ a merged upstream PR is your portfolio.
Start here: SETUP.md, then phase-00-foundations/00-guide.md.
See also: GLOSSARY.md (every term defined once) and CAREER.md (the maintainer path, the staff competency map, the startup playbook).
This repo also builds as a website (mdBook → Cloudflare Pages): see PUBLISHING.md.