Phase 13 — Cheatsheet: Multimodal Models
- Image -> vision encoder -> image embeddings -> placeholder token positions -> normal LLM.
- EncoderCacheManager reuses image features; don't recompute per step.
- Image tokens inflate seq length and KV usage; profile input processing.
Key upstream files
vllm/multimodal/vllm/multimodal/processing.pyvllm/v1/core/encoder_cache_manager.pyvllm/model_executor/models/llava.pyvllm/model_executor/models/qwen2_vl.py
Full reference: 00-guide.md · 01-deep-dive.md