Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 14-01 — Add an Architecture Without Touching the Engine [CPU-OK]

vLLM serves hundreds of architectures — Llama, Mixtral, DeepSeek, Mamba hybrids, embedding models — through one engine, and the trick is a discipline, not a miracle: models implement a narrow contract, and the engine calls nothing else. This lab makes you live that discipline in miniature. mini_vllm's contract is one method — forward(last_tokens, positions) → logits — and you'll implement a genuinely new architecture against it (a bigram model: logits from a per-token table, positions ignored), swap it into a running engine, and prove every engine feature works unchanged. The capstone test is a tripwire proxy that fails on any attribute access beyond forward — and the engine passes a full generate() through it, proving the contract is exactly one method, not asserting it.

Contents


Why this lab exists

"Add support for model X" is the single most common vLLM contribution — the on-ramp through which most maintainers arrived — and the task is approachable precisely because of the contract this lab teaches. A model integrator never touches the scheduler, the KV manager, or the sampler; they write a model class that honors the interface and a weight loader that fills it (labs 02/03). Knowing where the boundary sits — what you must provide, what you may ignore, what you must never reach around — is the difference between a weekend PR and a month of confusion.

The lab's sneaky-deep test is test_engine_invariants_hold_for_the_new_model: Phase 3's chunked-equals-unchunked is an engine property, and it must hold for any contract-honoring model. Run it against your new architecture and you're doing what vLLM's CI does across its whole model zoo — verifying that engine invariants and model implementations are independent axes. When an invariant breaks only for one model, the leak is in whoever crossed the boundary, and this test design localizes the suspect instantly.

Background: the narrow waist

The contract's anatomy, and why each piece is what it is:

  • forward(last_tokens, positions) → (batch, vocab) logits — the engine guarantees row i of the output corresponds to entry i of the inputs (Phase 1 lab-03's positional contract), and that only requests passing needs_sample appear (the catch-up rule). The model guarantees deterministic logits given its inputs. Neither knows anything else about the other.
  • Positions are offered, not mandated — your BigramModel ignores them entirely and the engine cannot tell. That's the proof that the contract over-supplies on purpose: it carries what the most demanding architecture needs (positional information for RoPE-style models), and simpler models discard the surplus. Real vLLM's contract is wider for the same reason (KV caches, attention metadata, intermediate states for EAGLE — Phase 8), and most models use a subset.
  • The registry is the production version of your install_model: config's architectures field → ModelRegistry lookup → class constructed with the vLLM config. Swapping a model is data, not code — which is also how out-of-tree models plug in (Phase 17's plugin machinery registers into the same table).

Files

  • starter.pyBigramModel (the new brain) and install_model (the swap). Your work.
  • solution.py — reference.
  • test_lab.py — serving works, the brain is genuinely different, determinism, the engine-invariant check, and the contract tripwire.

Run

LAB_IMPL=starter pytest phase-14-model-architectures/labs/lab-01-add-a-toy-architecture -q
pytest phase-14-model-architectures/labs/lab-01-add-a-toy-architecture -q   # reference

What the tests prove

TestWhat it pins
test_new_architecture_serves_through_the_unchanged_engineThe integration: batching, scheduling, sampling, stopping — all engine code, all untouched, all working
test_it_really_is_a_different_modelThe outputs differ from ToyModel's — you changed the brain, not the plumbing (a lab that accidentally re-implemented the old model would pass everything else)
test_determinism_across_engine_instancesThe new architecture honors the course's testability convention: logits as a pure function of (seed, inputs)
test_engine_invariants_hold_for_the_new_modelChunked ≡ unchunked for the new model — engine invariants are model-independent, verified rather than hoped
test_engine_touches_only_the_contractThe tripwire proxy: a full generate() with every attribute except forward booby-trapped. The contract's width, measured: one method

Hitchhiker's notes

  • The real contract, for comparison: a vLLM model implements forward(input_ids, positions, …) → hidden_states, compute_logits, and load_weights, composed from the layer library — VocabParallelEmbedding, QKVParallelLinear, RowParallelLinear (Phase 10 lab-01's classes!), Attention (which hides the entire Phase 2/4 machinery behind one call), RMSNorm. Building from these gives you TP, quantization, LoRA, and paged attention for free — the layer library is where the engine's features and the model's architecture meet, and using bare nn.Linear instead is the classic new-contributor mistake (works single-GPU, breaks under TP, bypasses quantization).
  • The tripwire-proxy test pattern generalizes: any time a design claims "X only uses interface Y," wrap Y's provider in a proxy that fails on everything else and run the full workload. Interfaces rot by accretion — someone reaches around for "just one attribute" — and a tripwire in CI is the only durable fence. (Compare Phase 9 lab-04's broken-control pattern: both are executable architecture documentation.)
  • Why a bigram model, of all things? Because ignoring positions is the point: the most instructive new architecture is one that uses less than the contract offers, proving the contract doesn't secretly require everything it carries. Hybrid/SSM models (Mamba-style) are the production version of this lesson — they need different state than KV caches, and vLLM's contract grew (state managers, hybrid allocators) precisely where their needs exceeded it.
  • mini_vllm's engine constructs its own model (no registry) — install_model papers over that with assignment. The gap is deliberate lab surface: notice how a registry (construct-from-config) beats post-hoc swapping the moment configs, checkpoints, and TP enter. The README of the real registry: upstream/vllm/model_executor/models/registry.py.

Going further

  • Add a second architecture — a RepeaterModel that strongly biases toward the last token (logits = one-hot-ish on last_token) — and watch greedy decoding produce aaaa...: a two-line model that generates the repetition pathology Phase 9 lab-01's penalties exist to fight. Then apply the penalty and watch it break the loop. Three labs, one demo.
  • Build a registry = {"bigram": BigramModel, "toy": ToyModel} and a engine_from_config({"architecture": "bigram", "seed": 7}) constructor — the real registration pattern, 10 lines, and now your lab-02/03 weight knowledge has a place to plug in.
  • Write the negative test: a model whose forward returns the wrong batch size, and assert the engine fails loudly rather than mis-assigning tokens (it fails in the sampler's indexing — would you ship a clearer assert upstream?).

References

  • upstream/vllm/model_executor/models/registry.py — the architecture → class table.
  • upstream/vllm/model_executor/models/llama.py — the canonical model implementation; read it as "the contract, honored" (and lab-02/03's subject).
  • vLLM docs, Adding a New Model — the official integrator's guide this lab is the warm-up for: https://docs.vllm.ai/en/latest/contributing/model/
  • Phase 1 lab-03 — the row-order contract this lab's forward inherits; Phase 10 lab-01 — the layer library that makes real models TP-able by construction.