Phase 12 — Mini-Build: a grammar mask for `mini_vllm`

Your task
Why build it (and not just read it)
The spec
Method
Definition of done
Map back to the real engine

Your task

Build mini_vllm/grammar.py: a regex-FSM grammar that produces a per-step allowed-token mask and plugs into the mini engine as a logits processor — so a generation literally cannot emit a string that violates the regex.

A reference implementation ships in mini_vllm/grammar.py with tests in mini_vllm/test_grammar.py. Build yours first; compare after.

Why build it (and not just read it)

Reading the real feature tells you what production does. Re-implementing a tiny version tells you why every decision was made — which is the understanding that survives into an interview or a 2 a.m. incident. Keep it small; keep it tested.

The spec

Mirror the upstream contract from backend_types.py, shrunk to its essence:

class RegexGrammar:
    """Compile once; one instance of state per request."""
    def __init__(self, pattern: str, vocab: dict[int, str]): ...
    def allowed_token_mask(self) -> "np.ndarray":  # bool[vocab_size]
        """True where emitting the token keeps a path to a match alive."""
    def accept_token(self, token_id: int) -> bool:   # advance; False if illegal
    def rollback(self, n: int) -> None:              # un-advance n tokens (spec decode!)
    def is_terminated(self) -> bool:                 # matched a full accepting state

Constraints that make it honest:

Compile the regex to an explicit FSM yourself (subset is fine: literals, [...] classes, |, *, +, ?, digits — no need for full PCRE). re may be used only as a test oracle, never inside the mask.
A token (multi-char string) is allowed iff feeding its chars through the FSM from the current state stays alive. Cache per-state token masks after first computation — that's xgrammar's compile-time trick in miniature.
rollback must restore the exact state — keep a state-stack history.
Wire it into mini_vllm/engine.py's sampling path as an optional logits_mask_fn(request) -> mask hook, applied as logits[~mask] = -inf pre-softmax.

Method

Re-read the matching real code: backend_types.py (the contract), backend_xgrammar.py:132 (XgrammarGrammar — yours is this class with the library replaced by your FSM).
Write the FSM compiler first; property-test it against re.fullmatch on random strings.
Add the token lifting + mask caching; then the engine hook.
pytest mini_vllm -q and keep it green.

Definition of done

CPU only, numpy only.
A test proves the property: an adversarial sampler (always picks the worst allowed token) still produces a string with re.fullmatch(pattern, out) ≠ None, for ≥ 3 patterns.
A test proves renormalization: with all tokens masked but one, that one is sampled with probability 1.
A test proves rollback: advance k tokens, rollback k, masks are bit-identical.
You can say out loud where yours simplifies: chars = bytes, no pushdown stack (lab-03 adds it), no compile-time context-independent/dependent token split (you cache lazily instead).

Map back to the real engine

Yours	Upstream
`RegexGrammar`	`StructuredOutputGrammar` impl (`backend_xgrammar.py:132`)
`allowed_token_mask()`	`fill_bitmask()` row (bool array vs packed int32 bits)
per-state mask cache	xgrammar compile-time token classification
`logits[~mask] = -inf` hook	`apply_grammar_bitmask` (`structured_output/utils.py:44`)
`rollback` state stack	`xgr.GrammarMatcher(max_rollback_tokens=…)`

vLLM Mastery — From Zero to Maintainer