Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 14 — Interview Questions: Model Architectures (Adding a Model)

Staff/principal-level questions on this topic. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)

Q1. Walk me through adding support for a new decoder-only model to vLLM.

Model answer

Implement the model as an nn.Module using vLLM's parallel layers + Attention; implement load_weights to remap the HF checkpoint (esp. fused QKV/gate-up); register it; add it to the supported list; and add a correctness test comparing greedy/logits to HF. Handle TP sharding, tied embeddings, and any quant/LoRA hooks.

Q2. Why must the model use vLLM's Linear/Attention layers instead of plain torch?

Model answer

Those layers carry tensor-parallel sharding, paged-attention metadata, quantization dispatch, and CUDA-graph/compile compatibility. Plain torch layers would bypass paging, TP, and quantization — breaking the whole engine's contract.

Going deeper

The flagship phases (02, 03) show the depth and number of questions to expect for a topic you claim as your specialty.