Lab 12-01 — Regex → FSM → Token Masks [CPU-OK]
Prompting a model to "respond with a hex number" gets you a hex number most of the time — and "most of the time" is a production outage with a delay. Constrained decoding replaces the request with a guarantee: compile the pattern to a finite-state machine, and each step mask the logits so only tokens that keep the machine alive are sampleable (Phase 9 lab-01's processor hook, finally meeting its most important client). The model cannot emit an invalid output — not "is unlikely to": cannot — and you'll prove it with an adversarial fake model that prefers garbage and emits valid hex anyway. Plus the two ideas that separate toy versions from real ones: the char-to-token lifting (the FSM walks characters; the model emits multi-character tokens) and the truncation caveat (the mask guarantees prefix validity, not completion — a test demonstrates the failure honestly).
Contents
- Why this lab exists
- Background: the three moves
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
Structured output is the feature that turned LLMs from chatbots into components — nothing downstream can consume "mostly JSON" — and it's also the feature whose implementation most people get wrong on the first guess (validate-and-retry? post-hoc repair? few-shot harder?). The correct answer is masks, and it's correct for a deep reason worth internalizing: it moves enforcement from after sampling (reject, retry, pray) to before (the invalid token's probability is −∞; renormalization spreads its mass over valid continuations). Zero retries, zero latency tax beyond the mask computation, and the guarantee is structural rather than statistical.
Building it small teaches you the production system's actual anatomy. Outlines'
famous contribution was precisely your allowed_tokens: precompute, for every FSM
state, which tokens (not characters) survive — turning the per-step cost from
"simulate the vocab" into a dict lookup. When you read vLLM's structured-output
manager (upstream/vllm/v1/structured_output/), you'll find your three functions
with caching and bitmask plumbing around them.
Background: the three moves
- Compile the pattern to a char-level FSM. The lab's pattern subset (atom
sequences: literals and
[...]classes, optional+) compiles to a beautifully simple machine — state i = "atoms 0..i−1 matched",+adds a self-loop. Real engines compile full regex via standard NFA→DFA machinery (interegular/outlines) — bigger automata, same interface:transitions,accepting. - Lift chars to tokens: token t is allowed in state s iff feeding t's
characters from s never hits a missing transition. This is where the tokenizer's
weirdness lives — a single token
"0x"crosses two atoms in one step, and your mask test pins exactly that. The lifted table is per-(state, vocab), computed once per pattern: the compile-time/runtime split that makes masking affordable at 128k vocab. - Mask per step: allowed tokens keep their logits, the rest get −∞, EOS is legal
iff the state is accepting (forcing the model through the pattern — the
test_eos_only_when_acceptingbehavior: a model that wants to stop immediately must emit a digit first).
Files
starter.py—parse_atoms,compile_pattern,advance,allowed_tokens,constrained_generate. Your work.solution.py— reference.test_lab.py— parsing, survival/death of multi-char tokens, mask exactness, the adversarial model, a 50-permutation fuzz, EOS gating, and the truncation caveat.
Run
LAB_IMPL=starter pytest phase-12-structured-outputs/labs/lab-01-regex-fsm-mask -q
pytest phase-12-structured-outputs/labs/lab-01-regex-fsm-mask -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_parse_atoms / test_advance_and_death | The compiler and the walker, including a token dying mid-token ("1x") — partial consumption must not corrupt state |
test_mask_is_exactly_the_survivors | The lifting: from start, only "0" and "0x" survive 0x[0-9a-f]+ — note the multi-char token legally crossing two atoms in one step |
test_constrained_output_always_matches | The adversarial guarantee: a model preferring q, @, zz emits valid hex anyway. The mask doesn't persuade; it removes the alternatives |
test_fuzzed_preferences_never_violate | 50 random models, zero violations — on a pattern chosen so truncation is harmless (see next row for why that choice was necessary) |
test_truncation_caveat_is_real | The honest failure: a digit-loving model loops in [0-9]+ and never reaches the .; at max_tokens the output is a valid prefix but not a valid match. Constrained decoding + token caps = possibly-incomplete output — a real production gotcha (validate downstream anyway!), demonstrated rather than footnoted |
test_eos_only_when_accepting | The stop token is part of the grammar: stopping is only legal where the pattern says so |
Hitchhiker's notes
- The compile-time/runtime split is the whole performance story. Compiling a
complex schema's automaton and lifting it over a 128k vocab takes real time
(xgrammar's headline is doing this fast + caching it); the per-step cost is then a
bitmask apply. vLLM compiles grammars asynchronously — a request can sit in
WAITING while its grammar compiles (a new reason to wait that Phase 3's scheduler
gates on; search
grammarin the V1 scheduler). First-request latency on a new schema vs steady-state is the operational signature of this split. - The mask must reach the GPU. Your set-of-ints becomes a
[batch, vocab]bitmask tensor applied inside the sampler (Phase 9 lab-01's pipeline, stage one). At 128k vocab × 256 batch that's real bytes per step — why the format is bitmask and why xgrammar emits them natively. - Tokenizer dependence is total: the lifted table is per-(pattern, tokenizer). Same pattern, different model → recompile. And exotic vocab corners (bytes, partial UTF-8 tokens) are exactly where naive lifters break — one more reason the production engines are libraries, not weekend scripts.
- The truncation caveat generalizes: any constrained system that guarantees
step-wise validity (prefix-closed) but not termination has this hole. Lab-03
shows its pushdown version (unclosed braces forever); real APIs return
finish_reason: "length"(Phase 1 lab-05!) on exactly these — your downstream parser must treat"length"+ structured output as suspect. The three labs of this course that compose here (1-05, 9-01, 12-01) are the whole story.
Going further
- Add
*(zero-or-more) to the pattern subset — note how it changes which states are skippable and that your state-numbering scheme needs epsilon-collapsing. You're one feature away from needing the real NFA→DFA pipeline; feel the cliff. - Precompute the full
state → allowed token-id listtable and benchmarkconstrained_generateagainst the on-the-fly version at vocab 50k (build a random vocab) — the outlines speedup, measured. - Wire the mask into Phase 9 lab-01's
Pipelineas a logits processor overmini_vllm'sByteTokenizervocab and constrain the toy engine end-to-end: structured output in your own engine, ~20 glue lines.
References
- Willard & Louf, Efficient Guided Generation for Large Language Models (2023) —
the outlines paper; your
allowed_tokensis its §3: https://arxiv.org/abs/2307.09702 upstream/vllm/v1/structured_output/— the manager, backends, and the async compile path.- Dong et al., XGrammar (2024) — the compile-time/runtime split industrialized: https://arxiv.org/abs/2411.15100
- Phase 9 lab-01 — the hook this mask rides; lab-03 — why JSON needs a stack on top of everything here.