Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 13 — Mini-Build: a fake-image pipeline for mini_vllm

Contents


Your task

Teach mini_vllm to serve a request that carries a fake "image": expand a placeholder into N synthetic image tokens, run a toy encoder (deterministic function of the image bytes), splice the embeddings, and cache encoder outputs by content hash so the same image is never encoded twice.

Why build it (and not just read it)

Reading the real feature tells you what production does. Re-implementing a tiny version tells you why every decision was made — which is the understanding that survives into an interview or a 2 a.m. incident. Keep it small; keep it tested.

The spec

  1. Request extension: Request may carry images: list[bytes] and a prompt containing the marker token <IMG>. A processing step expands each marker to num_image_tokens(image) placeholder token IDs and records PlaceholderRange(offset, length) — your own tiny dataclass. Make num_image_tokens = (len(image_bytes) // 64) + 1 so "resolution" varies (the dynamic-resolution lesson in one line).
  2. Toy encoder: encode(image_bytes) -> np.ndarray[length, d], deterministic (seed a RNG from the content hash). Pretend it's expensive: count invocations.
  3. Encoder cache: dict keyed by sha256(image_bytes), with per-request reference sets and an LRU freeable list with a capacity in embeddings — a 40-line EncoderCacheManager mirroring upstream's cached/freeable/freed trio.
  4. The splice: in the (fake) forward, inputs_embeds[is_image_position] = cached_embeddings — assert the count contract and raise the upstream-style "X multimodal tokens to Y placeholders" error on mismatch.
  5. Scheduler touch: image tokens must pass through your Phase-3 scheduler as ordinary tokens (KV blocks allocated, token budget consumed). If you did lab-03, optionally bolt on the per-step encoder budget + truncate-at-the-doorstep rule.

Method

  1. Re-read encoder_cache_manager.py:17 (docstring is the design doc) and models/utils.py:456 (the splice).
  2. Build processor → encoder → cache → splice in that order; test each before the next.
  3. pytest mini_vllm -q and keep it green.

Definition of done

  • CPU only, numpy only.
  • A test proves expansion arithmetic: a prompt with 2 images of different sizes yields correct total length and two correct PlaceholderRanges.
  • A test proves cache sharing: two requests, same image bytes → encoder invoked once; different bytes (same length!) → invoked twice. This is the content-hash lesson.
  • A test proves the contract: corrupt the expansion count and assert the mismatch error fires.
  • A test proves eviction: capacity for one image's embeddings; finish request A, admit B with a new image → A's entry evicted (and its hash reported freed), not B rejected.
  • You can say out loud where yours simplifies: no real ViT, no projector dim-matching, no is_embed masks, no chunked-prefill interaction unless you added it.

Map back to the real engine

YoursUpstream
marker expansion + range_get_prompt_updates (llava.py:264) + PlaceholderRange (inputs.py:119)
sha256(image_bytes)MultiModalHasher.hash_kwargs (hasher.py:154)
cache dict + refs + LRUEncoderCacheManager (encoder_cache_manager.py:17)
splice + count assert_merge_multimodal_embeddings (models/utils.py:456)
encoder budget rule (optional)_try_schedule_encoder_inputs (scheduler.py:1096)