Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 12 — Mini-Build: a grammar mask for mini_vllm

Contents


Your task

Build mini_vllm/grammar.py: a regex-FSM grammar that produces a per-step allowed-token mask and plugs into the mini engine as a logits processor — so a generation literally cannot emit a string that violates the regex.

A reference implementation ships in mini_vllm/grammar.py with tests in mini_vllm/test_grammar.py. Build yours first; compare after.

Why build it (and not just read it)

Reading the real feature tells you what production does. Re-implementing a tiny version tells you why every decision was made — which is the understanding that survives into an interview or a 2 a.m. incident. Keep it small; keep it tested.

The spec

Mirror the upstream contract from backend_types.py, shrunk to its essence:

class RegexGrammar:
    """Compile once; one instance of state per request."""
    def __init__(self, pattern: str, vocab: dict[int, str]): ...
    def allowed_token_mask(self) -> "np.ndarray":  # bool[vocab_size]
        """True where emitting the token keeps a path to a match alive."""
    def accept_token(self, token_id: int) -> bool:   # advance; False if illegal
    def rollback(self, n: int) -> None:              # un-advance n tokens (spec decode!)
    def is_terminated(self) -> bool:                 # matched a full accepting state

Constraints that make it honest:

  1. Compile the regex to an explicit FSM yourself (subset is fine: literals, [...] classes, |, *, +, ?, digits — no need for full PCRE). re may be used only as a test oracle, never inside the mask.
  2. A token (multi-char string) is allowed iff feeding its chars through the FSM from the current state stays alive. Cache per-state token masks after first computation — that's xgrammar's compile-time trick in miniature.
  3. rollback must restore the exact state — keep a state-stack history.
  4. Wire it into mini_vllm/engine.py's sampling path as an optional logits_mask_fn(request) -> mask hook, applied as logits[~mask] = -inf pre-softmax.

Method

  1. Re-read the matching real code: backend_types.py (the contract), backend_xgrammar.py:132 (XgrammarGrammar — yours is this class with the library replaced by your FSM).
  2. Write the FSM compiler first; property-test it against re.fullmatch on random strings.
  3. Add the token lifting + mask caching; then the engine hook.
  4. pytest mini_vllm -q and keep it green.

Definition of done

  • CPU only, numpy only.
  • A test proves the property: an adversarial sampler (always picks the worst allowed token) still produces a string with re.fullmatch(pattern, out) ≠ None, for ≥ 3 patterns.
  • A test proves renormalization: with all tokens masked but one, that one is sampled with probability 1.
  • A test proves rollback: advance k tokens, rollback k, masks are bit-identical.
  • You can say out loud where yours simplifies: chars = bytes, no pushdown stack (lab-03 adds it), no compile-time context-independent/dependent token split (you cache lazily instead).

Map back to the real engine

YoursUpstream
RegexGrammarStructuredOutputGrammar impl (backend_xgrammar.py:132)
allowed_token_mask()fill_bitmask() row (bool array vs packed int32 bits)
per-state mask cachexgrammar compile-time token classification
logits[~mask] = -inf hookapply_grammar_bitmask (structured_output/utils.py:44)
rollback state stackxgr.GrammarMatcher(max_rollback_tokens=…)