Phase 13 — Mini-Build: a fake-image pipeline for `mini_vllm`

Your task
Why build it (and not just read it)
The spec
Method
Definition of done
Map back to the real engine

Your task

Teach mini_vllm to serve a request that carries a fake "image": expand a placeholder into N synthetic image tokens, run a toy encoder (deterministic function of the image bytes), splice the embeddings, and cache encoder outputs by content hash so the same image is never encoded twice.

Why build it (and not just read it)

Reading the real feature tells you what production does. Re-implementing a tiny version tells you why every decision was made — which is the understanding that survives into an interview or a 2 a.m. incident. Keep it small; keep it tested.

The spec

Request extension: Request may carry images: list[bytes] and a prompt containing the marker token <IMG>. A processing step expands each marker to num_image_tokens(image) placeholder token IDs and records PlaceholderRange(offset, length) — your own tiny dataclass. Make num_image_tokens = (len(image_bytes) // 64) + 1 so "resolution" varies (the dynamic-resolution lesson in one line).
Toy encoder: encode(image_bytes) -> np.ndarray[length, d], deterministic (seed a RNG from the content hash). Pretend it's expensive: count invocations.
Encoder cache: dict keyed by sha256(image_bytes), with per-request reference sets and an LRU freeable list with a capacity in embeddings — a 40-line EncoderCacheManager mirroring upstream's cached/freeable/freed trio.
The splice: in the (fake) forward, inputs_embeds[is_image_position] = cached_embeddings — assert the count contract and raise the upstream-style "X multimodal tokens to Y placeholders" error on mismatch.
Scheduler touch: image tokens must pass through your Phase-3 scheduler as ordinary tokens (KV blocks allocated, token budget consumed). If you did lab-03, optionally bolt on the per-step encoder budget + truncate-at-the-doorstep rule.

Method

Re-read encoder_cache_manager.py:17 (docstring is the design doc) and models/utils.py:456 (the splice).
Build processor → encoder → cache → splice in that order; test each before the next.
pytest mini_vllm -q and keep it green.

Definition of done

CPU only, numpy only.
A test proves expansion arithmetic: a prompt with 2 images of different sizes yields correct total length and two correct PlaceholderRanges.
A test proves cache sharing: two requests, same image bytes → encoder invoked once; different bytes (same length!) → invoked twice. This is the content-hash lesson.
A test proves the contract: corrupt the expansion count and assert the mismatch error fires.
A test proves eviction: capacity for one image's embeddings; finish request A, admit B with a new image → A's entry evicted (and its hash reported freed), not B rejected.
You can say out loud where yours simplifies: no real ViT, no projector dim-matching, no is_embed masks, no chunked-prefill interaction unless you added it.

Map back to the real engine

Yours	Upstream
marker expansion + range	`_get_prompt_updates` (`llava.py:264`) + `PlaceholderRange` (`inputs.py:119`)
`sha256(image_bytes)`	`MultiModalHasher.hash_kwargs` (`hasher.py:154`)
cache dict + refs + LRU	`EncoderCacheManager` (`encoder_cache_manager.py:17`)
splice + count assert	`_merge_multimodal_embeddings` (`models/utils.py:456`)
encoder budget rule (optional)	`_try_schedule_encoder_inputs` (`scheduler.py:1096`)

vLLM Mastery — From Zero to Maintainer