Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 14 — Cheatsheet: Model Architectures (Adding a Model)

  • Model recipe: parallel layers + Attention -> register -> load_weights remap -> test vs HF.
  • Fused weights (QKV, gate_up) are the usual load_weights gotcha.
  • Interfaces/mixins declare LoRA/PP/MultiModal/pooling support.
  • Families: decoder-only, MoE, hybrid/SSM (Mamba), embedding/reward.

Key upstream files

  • vllm/model_executor/models/llama.py
  • vllm/model_executor/models/registry.py
  • vllm/model_executor/model_loader/
  • vllm/model_executor/models/mamba.py
  • vllm/model_executor/models/interfaces.py
  • tests/models/

Full reference: 00-guide.md · 01-deep-dive.md