Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 12-01 — Regex → FSM → Token Masks [CPU-OK]

Prompting a model to "respond with a hex number" gets you a hex number most of the time — and "most of the time" is a production outage with a delay. Constrained decoding replaces the request with a guarantee: compile the pattern to a finite-state machine, and each step mask the logits so only tokens that keep the machine alive are sampleable (Phase 9 lab-01's processor hook, finally meeting its most important client). The model cannot emit an invalid output — not "is unlikely to": cannot — and you'll prove it with an adversarial fake model that prefers garbage and emits valid hex anyway. Plus the two ideas that separate toy versions from real ones: the char-to-token lifting (the FSM walks characters; the model emits multi-character tokens) and the truncation caveat (the mask guarantees prefix validity, not completion — a test demonstrates the failure honestly).

Contents


Why this lab exists

Structured output is the feature that turned LLMs from chatbots into components — nothing downstream can consume "mostly JSON" — and it's also the feature whose implementation most people get wrong on the first guess (validate-and-retry? post-hoc repair? few-shot harder?). The correct answer is masks, and it's correct for a deep reason worth internalizing: it moves enforcement from after sampling (reject, retry, pray) to before (the invalid token's probability is −∞; renormalization spreads its mass over valid continuations). Zero retries, zero latency tax beyond the mask computation, and the guarantee is structural rather than statistical.

Building it small teaches you the production system's actual anatomy. Outlines' famous contribution was precisely your allowed_tokens: precompute, for every FSM state, which tokens (not characters) survive — turning the per-step cost from "simulate the vocab" into a dict lookup. When you read vLLM's structured-output manager (upstream/vllm/v1/structured_output/), you'll find your three functions with caching and bitmask plumbing around them.

Background: the three moves

  1. Compile the pattern to a char-level FSM. The lab's pattern subset (atom sequences: literals and [...] classes, optional +) compiles to a beautifully simple machine — state i = "atoms 0..i−1 matched", + adds a self-loop. Real engines compile full regex via standard NFA→DFA machinery (interegular/outlines) — bigger automata, same interface: transitions, accepting.
  2. Lift chars to tokens: token t is allowed in state s iff feeding t's characters from s never hits a missing transition. This is where the tokenizer's weirdness lives — a single token "0x" crosses two atoms in one step, and your mask test pins exactly that. The lifted table is per-(state, vocab), computed once per pattern: the compile-time/runtime split that makes masking affordable at 128k vocab.
  3. Mask per step: allowed tokens keep their logits, the rest get −∞, EOS is legal iff the state is accepting (forcing the model through the pattern — the test_eos_only_when_accepting behavior: a model that wants to stop immediately must emit a digit first).

Files

  • starter.pyparse_atoms, compile_pattern, advance, allowed_tokens, constrained_generate. Your work.
  • solution.py — reference.
  • test_lab.py — parsing, survival/death of multi-char tokens, mask exactness, the adversarial model, a 50-permutation fuzz, EOS gating, and the truncation caveat.

Run

LAB_IMPL=starter pytest phase-12-structured-outputs/labs/lab-01-regex-fsm-mask -q
pytest phase-12-structured-outputs/labs/lab-01-regex-fsm-mask -q   # reference

What the tests prove

TestWhat it pins
test_parse_atoms / test_advance_and_deathThe compiler and the walker, including a token dying mid-token ("1x") — partial consumption must not corrupt state
test_mask_is_exactly_the_survivorsThe lifting: from start, only "0" and "0x" survive 0x[0-9a-f]+ — note the multi-char token legally crossing two atoms in one step
test_constrained_output_always_matchesThe adversarial guarantee: a model preferring q, @, zz emits valid hex anyway. The mask doesn't persuade; it removes the alternatives
test_fuzzed_preferences_never_violate50 random models, zero violations — on a pattern chosen so truncation is harmless (see next row for why that choice was necessary)
test_truncation_caveat_is_realThe honest failure: a digit-loving model loops in [0-9]+ and never reaches the .; at max_tokens the output is a valid prefix but not a valid match. Constrained decoding + token caps = possibly-incomplete output — a real production gotcha (validate downstream anyway!), demonstrated rather than footnoted
test_eos_only_when_acceptingThe stop token is part of the grammar: stopping is only legal where the pattern says so

Hitchhiker's notes

  • The compile-time/runtime split is the whole performance story. Compiling a complex schema's automaton and lifting it over a 128k vocab takes real time (xgrammar's headline is doing this fast + caching it); the per-step cost is then a bitmask apply. vLLM compiles grammars asynchronously — a request can sit in WAITING while its grammar compiles (a new reason to wait that Phase 3's scheduler gates on; search grammar in the V1 scheduler). First-request latency on a new schema vs steady-state is the operational signature of this split.
  • The mask must reach the GPU. Your set-of-ints becomes a [batch, vocab] bitmask tensor applied inside the sampler (Phase 9 lab-01's pipeline, stage one). At 128k vocab × 256 batch that's real bytes per step — why the format is bitmask and why xgrammar emits them natively.
  • Tokenizer dependence is total: the lifted table is per-(pattern, tokenizer). Same pattern, different model → recompile. And exotic vocab corners (bytes, partial UTF-8 tokens) are exactly where naive lifters break — one more reason the production engines are libraries, not weekend scripts.
  • The truncation caveat generalizes: any constrained system that guarantees step-wise validity (prefix-closed) but not termination has this hole. Lab-03 shows its pushdown version (unclosed braces forever); real APIs return finish_reason: "length" (Phase 1 lab-05!) on exactly these — your downstream parser must treat "length" + structured output as suspect. The three labs of this course that compose here (1-05, 9-01, 12-01) are the whole story.

Going further

  • Add * (zero-or-more) to the pattern subset — note how it changes which states are skippable and that your state-numbering scheme needs epsilon-collapsing. You're one feature away from needing the real NFA→DFA pipeline; feel the cliff.
  • Precompute the full state → allowed token-id list table and benchmark constrained_generate against the on-the-fly version at vocab 50k (build a random vocab) — the outlines speedup, measured.
  • Wire the mask into Phase 9 lab-01's Pipeline as a logits processor over mini_vllm's ByteTokenizer vocab and constrain the toy engine end-to-end: structured output in your own engine, ~20 glue lines.

References

  • Willard & Louf, Efficient Guided Generation for Large Language Models (2023) — the outlines paper; your allowed_tokens is its §3: https://arxiv.org/abs/2307.09702
  • upstream/vllm/v1/structured_output/ — the manager, backends, and the async compile path.
  • Dong et al., XGrammar (2024) — the compile-time/runtime split industrialized: https://arxiv.org/abs/2411.15100
  • Phase 9 lab-01 — the hook this mask rides; lab-03 — why JSON needs a stack on top of everything here.