Phase 14 — Model Architectures (Adding a Model)

← Phase 13 · Course home · Phase 15 →

Don't Panic
Why this phase matters
What you'll learn
The map: where this lives in the real code
Labs in this phase
How to work this phase
Where you are

Don't Panic

vLLM supports 200+ architectures because adding one is a well-trodden recipe: write an nn.Module that uses vLLM's parallel layers and attention, map the checkpoint weights onto it, register it, done. This phase teaches that recipe — the single most valuable maintainer skill — across decoder-only, MoE, hybrid/SSM, and pooling models.

Why this phase matters

'Add support for model X' is the most common high-value vLLM contribution. Doing it well — correct weight mapping, TP-sharded layers, the right attention, tests — is exactly what earns maintainer trust.

What you'll learn

The model contract: init(vllm_config), forward(input_ids, positions, ...) -> hidden
vLLM building blocks: VocabParallelEmbedding, {Column,Row}ParallelLinear, Attention, RMSNorm
Weight loading: load_weights + the name-remapping from HF checkpoints
The model registry and how a name resolves to a class
Families: decoder-only (Llama), MoE (Mixtral), hybrid/SSM (Mamba/Jamba), pooling/reward
get_input_embeddings, tie_word_embeddings, LoRA/quant compatibility hooks

The map: where this lives in the real code

Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md) walks through the important ones line by line.

vllm/model_executor/models/llama.py — The reference decoder-only implementation.
vllm/model_executor/models/registry.py — The architecture registry.
vllm/model_executor/model_loader/ — Weight loading + checkpoint format handling.
vllm/model_executor/models/mamba.py — A state-space (non-attention) model.
vllm/model_executor/models/interfaces.py — Mixins: SupportsLoRA, SupportsPP, SupportsMultiModal, ...
tests/models/ — How model correctness is tested upstream (logit/greedy equality).

Labs in this phase

lab-01-add-a-toy-architecture [CPU-OK] — implement a new architecture against the mini_vllm model contract, serve it through the unchanged engine, and prove with a tripwire proxy that the contract is exactly one method.
lab-02-trace-weight-loading [GPU-OPT] — trace 5 tensors through llama.py's load_weights: name → mapping row → fused param → slice, with live shape verification. Captured mapping included.
lab-03-weight-mapping [CPU-OK] — implement the translation: q/k/v→qkv_proj renaming, GQA-aware slices, the loud shape-assert, and the fusion-legality theorem as a 1e-12 test.

See labs/README.md for the recommended order (01 → 03 → 02) and how to run them.

How to work this phase

Read this guide for intuition.
Read 01-deep-dive.md with the upstream/ files open.
Do 02-mini-build.md — build the mini_vllm piece yourself.
Run the labs, then attempt EXERCISES.md.
Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.

Where you are

This is one of the scaffolded phases: the guide, anchors, labs, exercises, and interview prompts are real and ready to study. The fully-worked, line-by-line treatment (with starter/ solution/test code in every lab) follows the gold-standard set by the flagship phases — Phase 02 · PagedAttention and Phase 03 · Continuous Batching. Use those two as the template for the depth to bring here.