Phase 13 — Interview Questions: Multimodal Models
Staff/principal-level questions on this topic. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)
Q1. How does a decoder-only LLM 'see' an image in vLLM?
Model answer
A vision encoder turns the image into embeddings that occupy a fixed set of placeholder token positions in the prompt. The language model then attends over text+image tokens uniformly. vLLM's input processor handles encoding, placeholder expansion, and caching the encoder output so it isn't recomputed each step.
Q2. What new bottlenecks do multimodal models add?
Model answer
The vision encoder is extra compute/memory before prefill; image tokens inflate sequence length (and KV); and the encoder-cache plus input-processing must be profiled and batched carefully, especially for dynamic-resolution models.
Going deeper
The flagship phases (02, 03) show the depth and number of questions to expect for a topic you claim as your specialty.