Phase 13 — Deep Dive: multimodal in real vLLM
Paths relative to
upstream/atv0.22.1 @ 0decac0(UPSTREAM_PIN.md). If a line number drifts in a newer tree, search for the named symbol.vllm/multimodal/inputs.py PlaceholderRange + input dataclasses (read first) vllm/multimodal/hasher.py MultiModalHasher — content identity vllm/model_executor/models/llava.py the canonical VLM (encoder+projector+splice) vllm/model_executor/models/utils.py _merge_multimodal_embeddings (the splice itself) vllm/v1/core/encoder_cache_manager.py the second cache vllm/v1/core/sched/scheduler.py _try_schedule_encoder_inputs (the hook) vllm/model_executor/models/qwen2_vl.py dynamic resolution (contrast case)
Contents
- 1. The contract:
PlaceholderRange - 2. The processor: prompt rewriting, LLaVA-style
- 3. The splice:
_merge_multimodal_embeddings - 4. Identity:
MultiModalHasher - 5. The second cache:
EncoderCacheManager - 6. The scheduler hook:
_try_schedule_encoder_inputs - 7. Contrast case: Qwen2-VL dynamic resolution
- Reading checklist
1. The contract: PlaceholderRange
vllm/multimodal/inputs.py:119 — class PlaceholderRange(offset, length, is_embed). The
docstring example is the whole idea: prompt AAAA BBBB What is… gives image A
PlaceholderRange(offset=0, length=4), image B (offset=5, length=4). is_embed is the
subtlety: some models put structure tokens inside the range (Pixtral inserts a row-break
token after each patch row — see llava.py:390, ([image_token_id] * ncols + [image_break_id]) * nrows), so a boolean mask says which positions actually receive
embeddings. Everything downstream — scheduler windowing, embedding merge, profiling — is
arithmetic over these ranges.
2. The processor: prompt rewriting, LLaVA-style
vllm/model_executor/models/llava.py is the layout to internalize, because Phase 14's
"add a model" recipe reuses every piece:
BaseLlavaProcessingInfo.get_num_image_tokens(:188) — asks the vision-encoder info object how many tokens an H×W image becomes. This number is model math, not a constant.LlavaDummyInputsBuilder(:222) — builds worst-case fake inputs (image_token * num_images) so startup profiling (Phase 1's memory measurement) sees the most expensive possible multimodal request before any real one arrives.BaseLlavaMultiModalProcessor._get_prompt_updates(:264) — the rewrite rule: replace oneimage_token_idwith[image_token_id] * num_image_tokens(:297). This is where<image>becomes 576 dummies and thePlaceholderRangeis born.- The registry (
vllm/multimodal/registry.py) binds processor classes to model classes via the@MULTIMODAL_REGISTRY.register_processordecorator on the model (:308region).
3. The splice: _merge_multimodal_embeddings
vllm/model_executor/models/utils.py:456. After embed_multimodal (llava.py:661) runs
encoder + projector (LlavaMultiModalProjector, :128 — two linears and an activation),
the merge is one line:
inputs_embeds[is_multimodal] = mm_embeds_flat.to(dtype=input_dtype)
An in-place masked scatter — pixels become "words" by assignment. Read the except RuntimeError block (:478): the count-mismatch error ("Attempted to assign X multimodal
tokens to Y placeholders") is the canonical symptom of a broken processor↔model contract,
and the first thing you'll debug when adding a VLM. Note also the comment about keeping
is_multimodal on CPU to avoid a device sync — model-runner hot path discipline.
4. Identity: MultiModalHasher
vllm/multimodal/hasher.py:50. hash_kwargs (:154) serializes each item (images go via
serialize_item, :52 — raw bytes, not object identity) through blake3-style hashing into
an mm_hash string. One hash, two jobs:
- Encoder-cache key — same image in any request hits the same cached embeddings.
- Prefix-cache ingredient — the hash is folded into KV block hashes covering the
placeholder span (Phase 3's
kv_cache_utilsblock hasher takesextra_keysfor exactly this), so identical dummy token IDs with different pixels cannot alias.
5. The second cache: EncoderCacheManager
vllm/v1/core/encoder_cache_manager.py:17. Read the class docstring — it is unusually
complete. The structure, mapped to Phase 2 vocabulary:
cached: dict[mm_hash, set[request_id]]— ref-counting by named references instead of an integerref_cnt(you can ask who holds it).freeable: OrderedDict[mm_hash, num_embeds]— the LRU free-queue analogue: entries with zero referencing requests, evictable oldest-first, reclaimed lazily at allocation time (can_allocate,:119) exactly like Phase 2's cached-block eviction.num_free_slotsvsnum_freeable_slots— actual free space vs free-after-evictions; the allocate path decides how much eviction it must perform.- Units are encoder embeddings, not blocks or bytes (see the NOTE in the docstring:
in-between break/text tokens don't count) — the budget that sized this cache comes from
compute_mm_encoder_budget(:269) at startup. get_freed_mm_hashes(:255) — drained each step intoSchedulerOutput(scheduler.py:901) so workers drop their copies: the manager is scheduler-side bookkeeping; the tensors live in the runner'sencoder_cachedict (gpu_model_runner.py:3065). Same split-brain pattern as KV: scheduler owns accounting, worker owns memory.
6. The scheduler hook: _try_schedule_encoder_inputs
vllm/v1/core/sched/scheduler.py:1096. Called for both running (:410) and waiting
(:679) requests. The docstring lists the four conditions (overlap with the computed
window; not already cached; encoder compute budget; encoder cache space). The mechanism to
study is the fallback: when an encoder input fails a check, the function truncates
num_new_tokens so the chunk ends just before the placeholder — the request still makes
progress on text, and the image waits for a step with budget. Consequences worth saying
out loud:
- Encoder work rides the same step as the decoder chunk that first overlaps the image — there is no separate "encoder phase" (contrast Phase 15's encode-disaggregated serving, where there is).
- The per-step
encoder_compute_budgetbounds step-time inflation; the cache-space check prevents an admission deadlock (an image that can never fit is rejected at the front door,compute_mm_encoder_budgetsizing guarantees the worst case fits). - On allocation (
:524/:810), the manager records the request as a referent; on request finish,free(:939) just de-references — the embeddings linger, freeable, for reuse.
7. Contrast case: Qwen2-VL dynamic resolution
vllm/model_executor/models/qwen2_vl.py. Versus LLaVA's fixed 576: token count is a
function of the actual image (grid_thw — patches per height/width/time), so
get_num_image_tokens does real arithmetic, video adds a time dimension, and M-RoPE
(multimodal rotary position encoding — text positions and 2-D image positions interleaved)
replaces vanilla RoPE. You don't need every detail; you need to recognize which parts of
the Phase-13 machinery flex (token counting, dummy-input profiling, position encoding)
and which don't (placeholder contract, encoder cache, scheduler hook — identical).
Reading checklist
-
PlaceholderRange— what isis_embedfor? Find the Pixtral line that makes it necessary (llava.py:390). -
_get_prompt_updatesin llava.py — where exactly does 1 token become N? -
_merge_multimodal_embeddings— what's the invariant, and what error message do you get when it breaks? -
EncoderCacheManager.check_and_update_cache/can_allocate— walk a second request arriving with the same image: which dict/list transitions happen? -
_try_schedule_encoder_inputs— all four scheduling conditions, and what happens tonum_new_tokenswhen one fails? -
In
scheduler.py:901, how do workers learn an encoder entry was evicted?
Now build it: 02-mini-build.md, then the labs.