Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 14 Labs — Model Architectures (Adding a Model)

Three labs on the most common vLLM contribution: adding a model. The arc: honor the contract — implement a new architecture and prove the engine never looks past one method (lab-01), translate the checkpoint — HF names to fused vLLM params, with the GQA slice arithmetic and the fusion-legality proof (lab-03), then trace the real thing — five tensors through llama.py's load_weights, shapes reconciled live (lab-02).

Recommended order: 01 → 03 → 02. CPU labs follow the standard contract — starter.py (your work), solution.py (reference), test_lab.py (the spec); default runs the solution, LAB_IMPL=starter grades yours.

# Whole phase (GPU tests auto-skip without CUDA):
pytest phase-14-model-architectures/labs -m "not gpu"

# Grade yourself on one lab:
LAB_IMPL=starter pytest phase-14-model-architectures/labs/lab-01-add-a-toy-architecture -q

Contents


Labs

lab-01-add-a-toy-architecture [CPU-OK]

Implement a genuinely new architecture (a bigram model that ignores positions) against mini_vllm's one-method contract and serve it through the unchanged engine — with the capstone tripwire test: a proxy that fails on any attribute access beyond forward survives a full generate(), measuring the contract's width. Plus the deep one: Phase 3's chunked-prefill invariant verified for the new model — engine invariants are model-independent. Skills: the narrow waist; over-supplying contracts; tripwire proxies as executable architecture docs; the layer library as where features meet models.

lab-02-trace-weight-loading [GPU-OPT]

Five tensors traced through the real load_weights: safetensors name → mapping row → fused parameter → slice — with live shape verification ((6144, 4096) qkv = 32 q-heads

  • 2×8 kv-heads, halving under TP=2) and checkpoint forensics from shapes alone. Captured mapping table included. Skills: reading load_weights as a peer; --load-format dummy; diagnosing loads-but-garbage; shapes as architecture fingerprints.

lab-03-weight-mapping [CPU-OK]

The translation implemented: q_proj/k_proj/v_proj → qkv_proj name rewriting with shard tags, GQA-aware slice arithmetic (k/v narrower than q — the off-by-one habitat), the loud shape-assert that catches MHA-checkpoint-meets-GQA-config at load time, and the legality theorem as a test: fused output slices ≡ separate projections to 1e-12. Skills: stacked_params_mapping; fusion is layout, not math; load-time asserts beat serve-time hallucinations.

What you can do after this phase

Walk the full integrator's path: implement a model against the contract using the layer library (getting TP/quant/LoRA for free), write its mapping table, load a real checkpoint, and verify with the discipline these labs drilled (touched-exactly-once, loud shape asserts, invariant tests against the new model). Read any file in vllm/model_executor/models/ as a variation on machinery you've built — and recognize "KeyError loading model X" issues as missing mapping rows you can fix. That's the on-ramp to Phase 19's real upstream PR.