Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 13 — Interview Questions: Multimodal Models

Staff/principal-level questions on this topic. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)

Q1. How does a decoder-only LLM 'see' an image in vLLM?

Model answer

A vision encoder turns the image into embeddings that occupy a fixed set of placeholder token positions in the prompt. The language model then attends over text+image tokens uniformly. vLLM's input processor handles encoding, placeholder expansion, and caching the encoder output so it isn't recomputed each step.

Q2. What new bottlenecks do multimodal models add?

Model answer

The vision encoder is extra compute/memory before prefill; image tokens inflate sequence length (and KV); and the encoder-cache plus input-processing must be profiled and batched carefully, especially for dynamic-resolution models.

Going deeper

The flagship phases (02, 03) show the depth and number of questions to expect for a topic you claim as your specialty.