Phase 13 — Deep Dive: multimodal in real vLLM

Paths relative to upstream/ at v0.22.1 @ 0decac0 (UPSTREAM_PIN.md). If a line number drifts in a newer tree, search for the named symbol.

vllm/multimodal/inputs.py                 PlaceholderRange + input dataclasses (read first)
vllm/multimodal/hasher.py                 MultiModalHasher — content identity
vllm/model_executor/models/llava.py       the canonical VLM (encoder+projector+splice)
vllm/model_executor/models/utils.py       _merge_multimodal_embeddings (the splice itself)
vllm/v1/core/encoder_cache_manager.py     the second cache
vllm/v1/core/sched/scheduler.py           _try_schedule_encoder_inputs (the hook)
vllm/model_executor/models/qwen2_vl.py    dynamic resolution (contrast case)

1. The contract: PlaceholderRange
2. The processor: prompt rewriting, LLaVA-style
3. The splice: _merge_multimodal_embeddings
4. Identity: MultiModalHasher
5. The second cache: EncoderCacheManager
6. The scheduler hook: _try_schedule_encoder_inputs
7. Contrast case: Qwen2-VL dynamic resolution
Reading checklist

1. The contract: `PlaceholderRange`

vllm/multimodal/inputs.py:119 — class PlaceholderRange(offset, length, is_embed). The docstring example is the whole idea: prompt AAAA BBBB What is… gives image A PlaceholderRange(offset=0, length=4), image B (offset=5, length=4). is_embed is the subtlety: some models put structure tokens inside the range (Pixtral inserts a row-break token after each patch row — see llava.py:390, ([image_token_id] * ncols + [image_break_id]) * nrows), so a boolean mask says which positions actually receive embeddings. Everything downstream — scheduler windowing, embedding merge, profiling — is arithmetic over these ranges.

2. The processor: prompt rewriting, LLaVA-style

vllm/model_executor/models/llava.py is the layout to internalize, because Phase 14's "add a model" recipe reuses every piece:

BaseLlavaProcessingInfo.get_num_image_tokens (:188) — asks the vision-encoder info object how many tokens an H×W image becomes. This number is model math, not a constant.
LlavaDummyInputsBuilder (:222) — builds worst-case fake inputs (image_token * num_images) so startup profiling (Phase 1's memory measurement) sees the most expensive possible multimodal request before any real one arrives.
BaseLlavaMultiModalProcessor._get_prompt_updates (:264) — the rewrite rule: replace one image_token_id with [image_token_id] * num_image_tokens (:297). This is where <image> becomes 576 dummies and the PlaceholderRange is born.
The registry (vllm/multimodal/registry.py) binds processor classes to model classes via the @MULTIMODAL_REGISTRY.register_processor decorator on the model (:308 region).

3. The splice: `_merge_multimodal_embeddings`

vllm/model_executor/models/utils.py:456. After embed_multimodal (llava.py:661) runs encoder + projector (LlavaMultiModalProjector, :128 — two linears and an activation), the merge is one line:

inputs_embeds[is_multimodal] = mm_embeds_flat.to(dtype=input_dtype)

An in-place masked scatter — pixels become "words" by assignment. Read the except RuntimeError block (:478): the count-mismatch error ("Attempted to assign X multimodal tokens to Y placeholders") is the canonical symptom of a broken processor↔model contract, and the first thing you'll debug when adding a VLM. Note also the comment about keeping is_multimodal on CPU to avoid a device sync — model-runner hot path discipline.

4. Identity: `MultiModalHasher`

vllm/multimodal/hasher.py:50. hash_kwargs (:154) serializes each item (images go via serialize_item, :52 — raw bytes, not object identity) through blake3-style hashing into an mm_hash string. One hash, two jobs:

Encoder-cache key — same image in any request hits the same cached embeddings.
Prefix-cache ingredient — the hash is folded into KV block hashes covering the placeholder span (Phase 3's kv_cache_utils block hasher takes extra_keys for exactly this), so identical dummy token IDs with different pixels cannot alias.

5. The second cache: `EncoderCacheManager`

vllm/v1/core/encoder_cache_manager.py:17. Read the class docstring — it is unusually complete. The structure, mapped to Phase 2 vocabulary:

cached: dict[mm_hash, set[request_id]] — ref-counting by named references instead of an integer ref_cnt (you can ask who holds it).
freeable: OrderedDict[mm_hash, num_embeds] — the LRU free-queue analogue: entries with zero referencing requests, evictable oldest-first, reclaimed lazily at allocation time (can_allocate, :119) exactly like Phase 2's cached-block eviction.
num_free_slots vs num_freeable_slots — actual free space vs free-after-evictions; the allocate path decides how much eviction it must perform.
Units are encoder embeddings, not blocks or bytes (see the NOTE in the docstring: in-between break/text tokens don't count) — the budget that sized this cache comes from compute_mm_encoder_budget (:269) at startup.
get_freed_mm_hashes (:255) — drained each step into SchedulerOutput (scheduler.py:901) so workers drop their copies: the manager is scheduler-side bookkeeping; the tensors live in the runner's encoder_cache dict (gpu_model_runner.py:3065). Same split-brain pattern as KV: scheduler owns accounting, worker owns memory.

6. The scheduler hook: `_try_schedule_encoder_inputs`

vllm/v1/core/sched/scheduler.py:1096. Called for both running (:410) and waiting (:679) requests. The docstring lists the four conditions (overlap with the computed window; not already cached; encoder compute budget; encoder cache space). The mechanism to study is the fallback: when an encoder input fails a check, the function truncates num_new_tokens so the chunk ends just before the placeholder — the request still makes progress on text, and the image waits for a step with budget. Consequences worth saying out loud:

Encoder work rides the same step as the decoder chunk that first overlaps the image — there is no separate "encoder phase" (contrast Phase 15's encode-disaggregated serving, where there is).
The per-step encoder_compute_budget bounds step-time inflation; the cache-space check prevents an admission deadlock (an image that can never fit is rejected at the front door, compute_mm_encoder_budget sizing guarantees the worst case fits).
On allocation (:524/:810), the manager records the request as a referent; on request finish, free (:939) just de-references — the embeddings linger, freeable, for reuse.

7. Contrast case: Qwen2-VL dynamic resolution

vllm/model_executor/models/qwen2_vl.py. Versus LLaVA's fixed 576: token count is a function of the actual image (grid_thw — patches per height/width/time), so get_num_image_tokens does real arithmetic, video adds a time dimension, and M-RoPE (multimodal rotary position encoding — text positions and 2-D image positions interleaved) replaces vanilla RoPE. You don't need every detail; you need to recognize which parts of the Phase-13 machinery flex (token counting, dummy-input profiling, position encoding) and which don't (placeholder contract, encoder cache, scheduler hook — identical).

Reading checklist

PlaceholderRange — what is is_embed for? Find the Pixtral line that makes it necessary (llava.py:390).
_get_prompt_updates in llava.py — where exactly does 1 token become N?
_merge_multimodal_embeddings — what's the invariant, and what error message do you get when it breaks?
EncoderCacheManager.check_and_update_cache / can_allocate — walk a second request arriving with the same image: which dict/list transitions happen?
_try_schedule_encoder_inputs — all four scheduling conditions, and what happens to num_new_tokens when one fails?
In scheduler.py:901, how do workers learn an encoder entry was evicted?

Now build it: 02-mini-build.md, then the labs.

vLLM Mastery — From Zero to Maintainer