Phase 13 — Mini-Build: a fake-image pipeline for mini_vllm
Contents
- Your task
- Why build it (and not just read it)
- The spec
- Method
- Definition of done
- Map back to the real engine
Your task
Teach mini_vllm to serve a request that carries a fake "image": expand a placeholder into
N synthetic image tokens, run a toy encoder (deterministic function of the image bytes),
splice the embeddings, and cache encoder outputs by content hash so the same image is never
encoded twice.
Why build it (and not just read it)
Reading the real feature tells you what production does. Re-implementing a tiny version tells you why every decision was made — which is the understanding that survives into an interview or a 2 a.m. incident. Keep it small; keep it tested.
The spec
- Request extension:
Requestmay carryimages: list[bytes]and a prompt containing the marker token<IMG>. A processing step expands each marker tonum_image_tokens(image)placeholder token IDs and recordsPlaceholderRange(offset, length)— your own tiny dataclass. Makenum_image_tokens = (len(image_bytes) // 64) + 1so "resolution" varies (the dynamic-resolution lesson in one line). - Toy encoder:
encode(image_bytes) -> np.ndarray[length, d], deterministic (seed a RNG from the content hash). Pretend it's expensive: count invocations. - Encoder cache: dict keyed by
sha256(image_bytes), with per-request reference sets and an LRUfreeablelist with a capacity in embeddings — a 40-lineEncoderCacheManagermirroring upstream'scached/freeable/freedtrio. - The splice: in the (fake) forward,
inputs_embeds[is_image_position] = cached_embeddings— assert the count contract and raise the upstream-style "X multimodal tokens to Y placeholders" error on mismatch. - Scheduler touch: image tokens must pass through your Phase-3 scheduler as ordinary tokens (KV blocks allocated, token budget consumed). If you did lab-03, optionally bolt on the per-step encoder budget + truncate-at-the-doorstep rule.
Method
- Re-read
encoder_cache_manager.py:17(docstring is the design doc) andmodels/utils.py:456(the splice). - Build processor → encoder → cache → splice in that order; test each before the next.
pytest mini_vllm -qand keep it green.
Definition of done
- CPU only, numpy only.
- A test proves expansion arithmetic: a prompt with 2 images of different sizes yields
correct total length and two correct
PlaceholderRanges. - A test proves cache sharing: two requests, same image bytes → encoder invoked once; different bytes (same length!) → invoked twice. This is the content-hash lesson.
- A test proves the contract: corrupt the expansion count and assert the mismatch error fires.
- A test proves eviction: capacity for one image's embeddings; finish request A, admit B with a new image → A's entry evicted (and its hash reported freed), not B rejected.
- You can say out loud where yours simplifies: no real ViT, no projector dim-matching, no
is_embedmasks, no chunked-prefill interaction unless you added it.
Map back to the real engine
| Yours | Upstream |
|---|---|
| marker expansion + range | _get_prompt_updates (llava.py:264) + PlaceholderRange (inputs.py:119) |
sha256(image_bytes) | MultiModalHasher.hash_kwargs (hasher.py:154) |
| cache dict + refs + LRU | EncoderCacheManager (encoder_cache_manager.py:17) |
| splice + count assert | _merge_multimodal_embeddings (models/utils.py:456) |
| encoder budget rule (optional) | _try_schedule_encoder_inputs (scheduler.py:1096) |