Phase 14 — Model Architectures (Adding a Model)
← Phase 13 · Course home · Phase 15 →
Contents
- Don't Panic
- Why this phase matters
- What you'll learn
- The map: where this lives in the real code
- Labs in this phase
- How to work this phase
- Where you are
Don't Panic
vLLM supports 200+ architectures because adding one is a well-trodden recipe: write an nn.Module that uses vLLM's parallel layers and attention, map the checkpoint weights onto it, register it, done. This phase teaches that recipe — the single most valuable maintainer skill — across decoder-only, MoE, hybrid/SSM, and pooling models.
Why this phase matters
'Add support for model X' is the most common high-value vLLM contribution. Doing it well — correct weight mapping, TP-sharded layers, the right attention, tests — is exactly what earns maintainer trust.
What you'll learn
- The model contract: init(vllm_config), forward(input_ids, positions, ...) -> hidden
- vLLM building blocks: VocabParallelEmbedding, {Column,Row}ParallelLinear, Attention, RMSNorm
- Weight loading: load_weights + the name-remapping from HF checkpoints
- The model registry and how a name resolves to a class
- Families: decoder-only (Llama), MoE (Mixtral), hybrid/SSM (Mamba/Jamba), pooling/reward
- get_input_embeddings, tie_word_embeddings, LoRA/quant compatibility hooks
The map: where this lives in the real code
Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see
UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md)
walks through the important ones line by line.
vllm/model_executor/models/llama.py— The reference decoder-only implementation.vllm/model_executor/models/registry.py— The architecture registry.vllm/model_executor/model_loader/— Weight loading + checkpoint format handling.vllm/model_executor/models/mamba.py— A state-space (non-attention) model.vllm/model_executor/models/interfaces.py— Mixins: SupportsLoRA, SupportsPP, SupportsMultiModal, ...tests/models/— How model correctness is tested upstream (logit/greedy equality).
Labs in this phase
- lab-01-add-a-toy-architecture
[CPU-OK]— implement a new architecture against the mini_vllm model contract, serve it through the unchanged engine, and prove with a tripwire proxy that the contract is exactly one method. - lab-02-trace-weight-loading
[GPU-OPT]— trace 5 tensors through llama.py's load_weights: name → mapping row → fused param → slice, with live shape verification. Captured mapping included. - lab-03-weight-mapping
[CPU-OK]— implement the translation: q/k/v→qkv_proj renaming, GQA-aware slices, the loud shape-assert, and the fusion-legality theorem as a 1e-12 test.
See labs/README.md for the recommended order (01 → 03 → 02) and how to run them.
How to work this phase
- Read this guide for intuition.
- Read 01-deep-dive.md with the
upstream/files open. - Do 02-mini-build.md — build the
mini_vllmpiece yourself. - Run the labs, then attempt EXERCISES.md.
- Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.
Where you are
This is one of the scaffolded phases: the guide, anchors, labs, exercises, and interview prompts are real and ready to study. The fully-worked, line-by-line treatment (with starter/ solution/test code in every lab) follows the gold-standard set by the flagship phases — Phase 02 · PagedAttention and Phase 03 · Continuous Batching. Use those two as the template for the depth to bring here.
← Phase 13 · Course home · Phase 15 →