Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 14 — Model Architectures (Adding a Model)

Phase 13 · Course home · Phase 15

Contents


Don't Panic

vLLM supports 200+ architectures because adding one is a well-trodden recipe: write an nn.Module that uses vLLM's parallel layers and attention, map the checkpoint weights onto it, register it, done. This phase teaches that recipe — the single most valuable maintainer skill — across decoder-only, MoE, hybrid/SSM, and pooling models.

Why this phase matters

'Add support for model X' is the most common high-value vLLM contribution. Doing it well — correct weight mapping, TP-sharded layers, the right attention, tests — is exactly what earns maintainer trust.

What you'll learn

  • The model contract: init(vllm_config), forward(input_ids, positions, ...) -> hidden
  • vLLM building blocks: VocabParallelEmbedding, {Column,Row}ParallelLinear, Attention, RMSNorm
  • Weight loading: load_weights + the name-remapping from HF checkpoints
  • The model registry and how a name resolves to a class
  • Families: decoder-only (Llama), MoE (Mixtral), hybrid/SSM (Mamba/Jamba), pooling/reward
  • get_input_embeddings, tie_word_embeddings, LoRA/quant compatibility hooks

The map: where this lives in the real code

Open these in upstream/ (pinned to v0.22.1 @ 0decac0, see UPSTREAM_PIN.md). The deep-dive (01-deep-dive.md) walks through the important ones line by line.

Labs in this phase

  • lab-01-add-a-toy-architecture [CPU-OK] — implement a new architecture against the mini_vllm model contract, serve it through the unchanged engine, and prove with a tripwire proxy that the contract is exactly one method.
  • lab-02-trace-weight-loading [GPU-OPT] — trace 5 tensors through llama.py's load_weights: name → mapping row → fused param → slice, with live shape verification. Captured mapping included.
  • lab-03-weight-mapping [CPU-OK] — implement the translation: q/k/v→qkv_proj renaming, GQA-aware slices, the loud shape-assert, and the fusion-legality theorem as a 1e-12 test.

See labs/README.md for the recommended order (01 → 03 → 02) and how to run them.

How to work this phase

  1. Read this guide for intuition.
  2. Read 01-deep-dive.md with the upstream/ files open.
  3. Do 02-mini-build.md — build the mini_vllm piece yourself.
  4. Run the labs, then attempt EXERCISES.md.
  5. Self-test with INTERVIEW.md; keep CHEATSHEET.md handy.

Where you are

This is one of the scaffolded phases: the guide, anchors, labs, exercises, and interview prompts are real and ready to study. The fully-worked, line-by-line treatment (with starter/ solution/test code in every lab) follows the gold-standard set by the flagship phases — Phase 02 · PagedAttention and Phase 03 · Continuous Batching. Use those two as the template for the depth to bring here.

Phase 13 · Course home · Phase 15