Phase 14 Labs — Model Architectures (Adding a Model)
Three labs on the most common vLLM contribution: adding a model. The arc: honor the
contract — implement a new architecture and prove the engine never looks past one
method (lab-01), translate the checkpoint — HF names to fused vLLM params, with
the GQA slice arithmetic and the fusion-legality proof (lab-03), then trace the
real thing — five tensors through llama.py's load_weights, shapes reconciled
live (lab-02).
Recommended order: 01 → 03 → 02. CPU labs follow the standard contract —
starter.py (your work), solution.py (reference), test_lab.py (the spec); default
runs the solution, LAB_IMPL=starter grades yours.
# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-14-model-architectures/labs -m "not gpu"
# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-14-model-architectures/labs/lab-01-add-a-toy-architecture -q
Contents
- lab-01-add-a-toy-architecture
[CPU-OK] - lab-02-trace-weight-loading
[GPU-OPT] - lab-03-weight-mapping
[CPU-OK] - What you can do after this phase
Labs
lab-01-add-a-toy-architecture [CPU-OK]
Implement a genuinely new architecture (a bigram model that ignores positions)
against mini_vllm's one-method contract and serve it through the unchanged engine —
with the capstone tripwire test: a proxy that fails on any attribute access beyond
forward survives a full generate(), measuring the contract's width. Plus the
deep one: Phase 3's chunked-prefill invariant verified for the new model — engine
invariants are model-independent. Skills: the narrow waist; over-supplying
contracts; tripwire proxies as executable architecture docs; the layer library as
where features meet models.
lab-02-trace-weight-loading [GPU-OPT]
Five tensors traced through the real load_weights: safetensors name → mapping row →
fused parameter → slice — with live shape verification ((6144, 4096) qkv = 32 q-heads
- 2×8 kv-heads, halving under TP=2) and checkpoint forensics from shapes alone.
Captured mapping table included. Skills: reading
load_weightsas a peer;--load-format dummy; diagnosing loads-but-garbage; shapes as architecture fingerprints.
lab-03-weight-mapping [CPU-OK]
The translation implemented: q_proj/k_proj/v_proj → qkv_proj name rewriting with
shard tags, GQA-aware slice arithmetic (k/v narrower than q — the off-by-one
habitat), the loud shape-assert that catches MHA-checkpoint-meets-GQA-config at load
time, and the legality theorem as a test: fused output slices ≡ separate projections
to 1e-12. Skills: stacked_params_mapping; fusion is layout, not math; load-time
asserts beat serve-time hallucinations.
What you can do after this phase
Walk the full integrator's path: implement a model against the contract using the
layer library (getting TP/quant/LoRA for free), write its mapping table, load a real
checkpoint, and verify with the discipline these labs drilled (touched-exactly-once,
loud shape asserts, invariant tests against the new model). Read any file in
vllm/model_executor/models/ as a variation on machinery you've built — and
recognize "KeyError loading model X" issues as missing mapping rows you can fix.
That's the on-ramp to Phase 19's real upstream PR.