Phase 12 — Mini-Build: a grammar mask for mini_vllm
Contents
- Your task
- Why build it (and not just read it)
- The spec
- Method
- Definition of done
- Map back to the real engine
Your task
Build mini_vllm/grammar.py: a regex-FSM grammar that produces a per-step allowed-token
mask and plugs into the mini engine as a logits processor — so a generation literally cannot
emit a string that violates the regex.
A reference implementation ships in mini_vllm/grammar.py
with tests in mini_vllm/test_grammar.py. Build yours
first; compare after.
Why build it (and not just read it)
Reading the real feature tells you what production does. Re-implementing a tiny version tells you why every decision was made — which is the understanding that survives into an interview or a 2 a.m. incident. Keep it small; keep it tested.
The spec
Mirror the upstream contract from backend_types.py, shrunk to its essence:
class RegexGrammar:
"""Compile once; one instance of state per request."""
def __init__(self, pattern: str, vocab: dict[int, str]): ...
def allowed_token_mask(self) -> "np.ndarray": # bool[vocab_size]
"""True where emitting the token keeps a path to a match alive."""
def accept_token(self, token_id: int) -> bool: # advance; False if illegal
def rollback(self, n: int) -> None: # un-advance n tokens (spec decode!)
def is_terminated(self) -> bool: # matched a full accepting state
Constraints that make it honest:
- Compile the regex to an explicit FSM yourself (subset is fine: literals,
[...]classes,|,*,+,?, digits — no need for full PCRE).remay be used only as a test oracle, never inside the mask. - A token (multi-char string) is allowed iff feeding its chars through the FSM from the current state stays alive. Cache per-state token masks after first computation — that's xgrammar's compile-time trick in miniature.
rollbackmust restore the exact state — keep a state-stack history.- Wire it into
mini_vllm/engine.py's sampling path as an optionallogits_mask_fn(request) -> maskhook, applied aslogits[~mask] = -infpre-softmax.
Method
- Re-read the matching real code:
backend_types.py(the contract),backend_xgrammar.py:132(XgrammarGrammar— yours is this class with the library replaced by your FSM). - Write the FSM compiler first; property-test it against
re.fullmatchon random strings. - Add the token lifting + mask caching; then the engine hook.
pytest mini_vllm -qand keep it green.
Definition of done
- CPU only, numpy only.
- A test proves the property: an adversarial sampler (always picks the worst allowed
token) still produces a string with
re.fullmatch(pattern, out)≠ None, for ≥ 3 patterns. - A test proves renormalization: with all tokens masked but one, that one is sampled with probability 1.
- A test proves rollback: advance k tokens, rollback k, masks are bit-identical.
- You can say out loud where yours simplifies: chars = bytes, no pushdown stack (lab-03 adds it), no compile-time context-independent/dependent token split (you cache lazily instead).
Map back to the real engine
| Yours | Upstream |
|---|---|
RegexGrammar | StructuredOutputGrammar impl (backend_xgrammar.py:132) |
allowed_token_mask() | fill_bitmask() row (bool array vs packed int32 bits) |
| per-state mask cache | xgrammar compile-time token classification |
logits[~mask] = -inf hook | apply_grammar_bitmask (structured_output/utils.py:44) |
rollback state stack | xgr.GrammarMatcher(max_rollback_tokens=…) |