Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 13 — Cheatsheet: Multimodal Models

  • Image -> vision encoder -> image embeddings -> placeholder token positions -> normal LLM.
  • EncoderCacheManager reuses image features; don't recompute per step.
  • Image tokens inflate seq length and KV usage; profile input processing.

Key upstream files

  • vllm/multimodal/
  • vllm/multimodal/processing.py
  • vllm/v1/core/encoder_cache_manager.py
  • vllm/model_executor/models/llava.py
  • vllm/model_executor/models/qwen2_vl.py

Full reference: 00-guide.md · 01-deep-dive.md