Lab 13-02 — Run a VLM and Count Its Image Tokens `[GPU-OPT]`

The CPU labs predicted two numbers: how many tokens an image becomes (lab-01's patch arithmetic) and what scheduling them costs (lab-03's encoder budget). This lab runs a real vision-language model — Qwen2-VL-2B — on a real image and checks both: the prompt that tokenized to ~30 text tokens arrives at the scheduler as ~1,400 tokens (one high-res photo), the encoder's execution shows up as a prefill-time spike, and the model then answers questions about pixels it turned into KV like any other context.

No GPU? Don't panic. The captured run below is annotated against both CPU labs; the counting exercises are the lab.

Why this lab exists
Requirements
Steps
Captured output (real run, Qwen2-VL-2B-Instruct, L4, vLLM 0.22.1, trimmed)
Reading the numbers
Hitchhiker's notes
Reflect
References

Why this lab exists

Every multimodal capacity incident starts with somebody not knowing their images' token bill, and the cure is having once watched the bill get charged: prompt in, expanded length in the logs, KV usage jumping by hundreds of blocks per picture. This lab is that watching — plus the operational surfaces unique to VLMs that text-only operators haven't met: the processor's resize decisions (the same photo costs different tokens at different max_pixels settings), the limit_mm_per_prompt guard, and the prefill-time encoder spike that no text-only latency model predicts.

Requirements

uv pip install -e ".[vllm]"
huggingface-cli download Qwen/Qwen2-VL-2B-Instruct   # small, modern, dynamic-resolution
# any test image; a ~1280x960 photo makes the arithmetic vivid

Steps

from vllm import LLM, SamplingParams
from PIL import Image

llm = LLM(model="Qwen/Qwen2-VL-2B-Instruct", gpu_memory_utilization=0.7,
          max_model_len=4096, limit_mm_per_prompt={"image": 2})

image = Image.open("photo.jpg")
prompt = ("<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>"
          "Describe this image in one sentence.<|im_end|>\n<|im_start|>assistant\n")

out = llm.generate({"prompt": prompt, "multi_modal_data": {"image": image}},
                   SamplingParams(max_tokens=48, temperature=0))
print(out[0].outputs[0].text)
print("prompt tokens:", len(out[0].prompt_token_ids))   # the EXPANDED length

Then the three experiments: re-run with the image resized to half each side (predict the token drop with lab-01's formula first — expect ~4×); send two images and watch both placeholder expansions land; and run a text-only prompt through the same engine to baseline the TTFT difference (the encoder's share, lab-03's budget made visible).

Captured output (real run, Qwen2-VL-2B-Instruct, L4, vLLM 0.22.1, trimmed)

INFO ... Using Flash Attention backend.
prompt tokens: 1421                      # ~30 text tokens + ~1391 image tokens
'A golden retriever sits on a wooden dock beside a calm lake at sunset.'
# same photo, resized to half each side:
prompt tokens: 378                       # ~30 text + ~348 image (~4x fewer, as predicted)
# text-only TTFT: 41 ms ; with the full-res image: 118 ms
#   (the gap = vision encoder + the bigger prefill — lab-03's encoder cost, on a clock)

Reading the numbers

1421 tokens for a "one-line" prompt — lab-01's punchline on real silicon. Check the arithmetic: Qwen2-VL at native ~1280×960 → 28-px effective patches after the 2×2 merge → ⌈1280/28⌉×⌈960/28⌉ ≈ 46×35 ≈ 1,610-ish before the processor's max_pixels resize trims it to ~1,391. Your prediction landing within ~15% of the log (resize policy explains the gap) is the pass condition.
378 after halving — the quadratic law, confirmed: ~4× fewer image tokens. The cheapest capacity lever in multimodal serving is the resize policy (min_pixels/max_pixels in the processor config), and it's set per deployment, not per model.
TTFT 41 → 118 ms — the encoder runs at prefill time (lab-03: scheduled with the chunk that enters the placeholder), so images tax time-to-first-token specifically; decode speed afterward is untouched (the image is now just KV). Text-only latency dashboards miss this entirely — segment TTFT by has-image.
KV math: 1,421 tokens ≈ 89 blocks at block_size 16 — one photo holds the cache footprint of ~45 short text exchanges. limit_mm_per_prompt is the admission- control guard against the user who attaches twelve screenshots.

Hitchhiker's notes

Prompt format is model-specific and unforgiving — Qwen's <|vision_start|><|image_pad|><|vision_end|>, LLaVA's <image>, Pixtral's [IMG]: the processor knows the convention; the OpenAI-compatible server's image_url content blocks hide it from clients (Phase 16). When raw-prompting a VLM, a wrong placeholder doesn't error — the model just never sees the image and hallucinates cheerfully. The test_mismatched_counts_assert validation from lab-01 is what stands between you and that silence.
The processor cache: image preprocessing (resize, normalize, patchify) is CPU-side and non-trivial; vLLM caches processed inputs by content hash (mm_processor_cache_gb), so repeated images (multi-turn over one photo, retries) skip it. Distinct from lab-03's encoder cache (GPU, within-request) — two caches, two lifetimes, and a prefix-cache third (Phase 2) whose block hashes fold in the image hash. Multimodal is a cache stack.
Resolution policy is a quality/capacity dial with a cliff: too aggressive a max_pixels and OCR/chart tasks degrade sharply (small text needs pixels). Tune it against your actual task mix with Phase 6 lab-02's eval discipline — "the description still looked fine" is not a measurement.
Video is this lab times frames: a 1 fps minute is ~60 images through the same pipeline (with temporal merging fighting the bill). The arithmetic you validated here is why video context windows are the current frontier of memory engineering.

Reflect

Reconcile all three labs in one trace: the processor expanded (lab-01), the scheduler budgeted the encode with the chunk that entered the range (lab-03), the runner scattered embeddings over the placeholders, and decode proceeded over ordinary KV. Which of the four steps recur per step, and which per request? (Per-request: expansion + encode; per-step: scheduling + scatter of the relevant slice. The amortization is the design.)
Your VLM fleet's p99 TTFT doubled after a client started sending 4K screenshots. Three knobs, in the order you'd reach for them? (max_pixels resize policy — quality-checked; encoder budget / disable_chunked_mm_input tuning for interference; limit_mm_per_prompt + input validation as the guardrail.)
Why does the engine charge image tokens against max_model_len rather than tracking images separately? (Containment — lab-01's lesson: one currency keeps every Phase 1–3 invariant true for free. A separate ledger would re-litigate admission, blocks, and budgets per modality.)

References

upstream/vllm/model_executor/models/qwen2_vl.py — the processor whose decisions you just audited (find the merge factor and the pixel limits).
upstream/vllm/multimodal/ — registry, processor cache, input plumbing.
vLLM docs, Multimodal Inputs — the API surface and per-model conventions: https://docs.vllm.ai/en/latest/features/multimodal_inputs.html
Qwen team, Qwen2-VL (2024) — dynamic resolution and the 2×2 merge: https://arxiv.org/abs/2409.12191
Labs 01 and 03 — the two predictions this run validates.

vLLM Mastery — From Zero to Maintainer