Lab 14-01 — Add an Architecture Without Touching the Engine [CPU-OK]
vLLM serves hundreds of architectures — Llama, Mixtral, DeepSeek, Mamba hybrids,
embedding models — through one engine, and the trick is a discipline, not a
miracle: models implement a narrow contract, and the engine calls nothing else.
This lab makes you live that discipline in miniature. mini_vllm's contract is one
method — forward(last_tokens, positions) → logits — and you'll implement a genuinely
new architecture against it (a bigram model: logits from a per-token table, positions
ignored), swap it into a running engine, and prove every engine feature works
unchanged. The capstone test is a tripwire proxy that fails on any attribute access
beyond forward — and the engine passes a full generate() through it, proving the
contract is exactly one method, not asserting it.
Contents
- Why this lab exists
- Background: the narrow waist
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
"Add support for model X" is the single most common vLLM contribution — the on-ramp through which most maintainers arrived — and the task is approachable precisely because of the contract this lab teaches. A model integrator never touches the scheduler, the KV manager, or the sampler; they write a model class that honors the interface and a weight loader that fills it (labs 02/03). Knowing where the boundary sits — what you must provide, what you may ignore, what you must never reach around — is the difference between a weekend PR and a month of confusion.
The lab's sneaky-deep test is test_engine_invariants_hold_for_the_new_model:
Phase 3's chunked-equals-unchunked is an engine property, and it must hold for any
contract-honoring model. Run it against your new architecture and you're doing what
vLLM's CI does across its whole model zoo — verifying that engine invariants and
model implementations are independent axes. When an invariant breaks only for one
model, the leak is in whoever crossed the boundary, and this test design localizes
the suspect instantly.
Background: the narrow waist
The contract's anatomy, and why each piece is what it is:
forward(last_tokens, positions) → (batch, vocab) logits— the engine guarantees row i of the output corresponds to entry i of the inputs (Phase 1 lab-03's positional contract), and that only requests passingneeds_sampleappear (the catch-up rule). The model guarantees deterministic logits given its inputs. Neither knows anything else about the other.- Positions are offered, not mandated — your
BigramModelignores them entirely and the engine cannot tell. That's the proof that the contract over-supplies on purpose: it carries what the most demanding architecture needs (positional information for RoPE-style models), and simpler models discard the surplus. Real vLLM's contract is wider for the same reason (KV caches, attention metadata, intermediate states for EAGLE — Phase 8), and most models use a subset. - The registry is the production version of your
install_model: config'sarchitecturesfield →ModelRegistrylookup → class constructed with the vLLM config. Swapping a model is data, not code — which is also how out-of-tree models plug in (Phase 17's plugin machinery registers into the same table).
Files
starter.py—BigramModel(the new brain) andinstall_model(the swap). Your work.solution.py— reference.test_lab.py— serving works, the brain is genuinely different, determinism, the engine-invariant check, and the contract tripwire.
Run
LAB_IMPL=starter pytest phase-14-model-architectures/labs/lab-01-add-a-toy-architecture -q
pytest phase-14-model-architectures/labs/lab-01-add-a-toy-architecture -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_new_architecture_serves_through_the_unchanged_engine | The integration: batching, scheduling, sampling, stopping — all engine code, all untouched, all working |
test_it_really_is_a_different_model | The outputs differ from ToyModel's — you changed the brain, not the plumbing (a lab that accidentally re-implemented the old model would pass everything else) |
test_determinism_across_engine_instances | The new architecture honors the course's testability convention: logits as a pure function of (seed, inputs) |
test_engine_invariants_hold_for_the_new_model | Chunked ≡ unchunked for the new model — engine invariants are model-independent, verified rather than hoped |
test_engine_touches_only_the_contract | The tripwire proxy: a full generate() with every attribute except forward booby-trapped. The contract's width, measured: one method |
Hitchhiker's notes
- The real contract, for comparison: a vLLM model implements
forward(input_ids, positions, …) → hidden_states,compute_logits, andload_weights, composed from the layer library —VocabParallelEmbedding,QKVParallelLinear,RowParallelLinear(Phase 10 lab-01's classes!),Attention(which hides the entire Phase 2/4 machinery behind one call),RMSNorm. Building from these gives you TP, quantization, LoRA, and paged attention for free — the layer library is where the engine's features and the model's architecture meet, and using barenn.Linearinstead is the classic new-contributor mistake (works single-GPU, breaks under TP, bypasses quantization). - The tripwire-proxy test pattern generalizes: any time a design claims "X only uses interface Y," wrap Y's provider in a proxy that fails on everything else and run the full workload. Interfaces rot by accretion — someone reaches around for "just one attribute" — and a tripwire in CI is the only durable fence. (Compare Phase 9 lab-04's broken-control pattern: both are executable architecture documentation.)
- Why a bigram model, of all things? Because ignoring
positionsis the point: the most instructive new architecture is one that uses less than the contract offers, proving the contract doesn't secretly require everything it carries. Hybrid/SSM models (Mamba-style) are the production version of this lesson — they need different state than KV caches, and vLLM's contract grew (state managers, hybrid allocators) precisely where their needs exceeded it. mini_vllm's engine constructs its own model (no registry) —install_modelpapers over that with assignment. The gap is deliberate lab surface: notice how a registry (construct-from-config) beats post-hoc swapping the moment configs, checkpoints, and TP enter. The README of the real registry:upstream/vllm/model_executor/models/registry.py.
Going further
- Add a second architecture — a
RepeaterModelthat strongly biases toward the last token (logits = one-hot-ish onlast_token) — and watch greedy decoding produceaaaa...: a two-line model that generates the repetition pathology Phase 9 lab-01's penalties exist to fight. Then apply the penalty and watch it break the loop. Three labs, one demo. - Build a
registry = {"bigram": BigramModel, "toy": ToyModel}and aengine_from_config({"architecture": "bigram", "seed": 7})constructor — the real registration pattern, 10 lines, and now your lab-02/03 weight knowledge has a place to plug in. - Write the negative test: a model whose
forwardreturns the wrong batch size, and assert the engine fails loudly rather than mis-assigning tokens (it fails in the sampler's indexing — would you ship a clearer assert upstream?).
References
upstream/vllm/model_executor/models/registry.py— the architecture → class table.upstream/vllm/model_executor/models/llama.py— the canonical model implementation; read it as "the contract, honored" (and lab-02/03's subject).- vLLM docs, Adding a New Model — the official integrator's guide this lab is the warm-up for: https://docs.vllm.ai/en/latest/contributing/model/
- Phase 1 lab-03 — the row-order contract this lab's
forwardinherits; Phase 10 lab-01 — the layer library that makes real models TP-able by construction.