Phase 14 — Interview Questions: Model Architectures (Adding a Model)
Staff/principal-level questions on this topic. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)
Q1. Walk me through adding support for a new decoder-only model to vLLM.
Model answer
Implement the model as an nn.Module using vLLM's parallel layers + Attention; implement load_weights to remap the HF checkpoint (esp. fused QKV/gate-up); register it; add it to the supported list; and add a correctness test comparing greedy/logits to HF. Handle TP sharding, tied embeddings, and any quant/LoRA hooks.
Q2. Why must the model use vLLM's Linear/Attention layers instead of plain torch?
Model answer
Those layers carry tensor-parallel sharding, paged-attention metadata, quantization dispatch, and CUDA-graph/compile compatibility. Plain torch layers would bypass paging, TP, and quantization — breaking the whole engine's contract.
Going deeper
The flagship phases (02, 03) show the depth and number of questions to expect for a topic you claim as your specialty.