Lab 12-03 — JSON Needs a Stack: the Pushdown Mask [CPU-OK]
Try to write lab-01's FSM for "balanced nested braces" and you'll hit a wall that's
been a theorem since 1956: a finite-state machine cannot count unbounded nesting —
matching { to } at arbitrary depth requires memory that grows, i.e. a stack.
JSON nests. Therefore JSON is not a regular language, regex-based masking cannot
enforce it, and the production answer (xgrammar) compiles context-free grammars to
pushdown automata. In this lab you build the pushdown machine for a JSON subset —
modes plus a depth counter that is the stack — and prove the strongest property in
the phase: a model that hates closing braces (they're its least-preferred character)
still emits output that json.loads accepts. Along the way the fuzz oracle will catch
a grammar corner most humans forget (this lab's reference implementation forgot it
too, the first time): JSON forbids leading zeros.
Contents
- Why this lab exists
- Background: modes + depth = pushdown
- Files
- Run
- What the tests prove
- Hitchhiker's notes
- Going further
- References
Why this lab exists
The regular/context-free boundary is the single most practical piece of formal-
language theory in modern serving, because it's a product boundary: "constrain to a
regex" and "constrain to a JSON schema" are different features with different engines,
different compile costs, and different failure modes — and engineers who don't know
why ship the wrong one. After this lab the boundary is physical: you'll have built
both machines and the difference is one integer (depth) that lab-01's FSM has
nowhere to put.
The second lesson is oracle-driven grammar debugging, courtesy of an honest war
story: the fuzz test here validates completed generations with json.loads — an
independent, stricter implementation of the spec — and on this lab's first draft it
rejected "0123", which the hand-built machine happily accepted. The machine's
grammar was wrong (JSON ints are '0' | [1-9][0-9]*), and no amount of testing the
machine against itself would have found it. Constrained decoding is only as correct
as its grammar; always fuzz against an independent parser. That habit is worth more
than the lab.
Background: modes + depth = pushdown
The machine is lab-01's FSM plus one number. Modes track position inside the
local grammar production (value, int, int_zero, obj_first, key, key_body,
colon, comma_or_close, obj_key, done); depth counts open objects — and
because this grammar's only recursive construct is the object, depth is the entire
stack (a general CFG pushes symbols; we push indistinguishable ones, so a counter
suffices — a nice special case to notice). The two rules that make it click:
'{'invaluemode pushes (depth+1) and entersobj_first;'}'pops and then asks: stack empty? →done(only EOS legal); else →comma_or_close(the enclosing object resumes). That resume-the-parent move is what no FSM can express — the parent's state was remembered by the stack.- Ending a value is context-dependent: a top-level int may end at EOS; the same int
inside an object must be followed by
,or}. Henceallowed()consultsdepth— the mask itself is stack-aware, which is the whole point.
One implementation subtlety worth savoring: feeding , while in int mode must
first end the int, then re-dispatch the comma in the new mode (feed calls itself
once). Tokens that terminate one production and belong to the next are the
bread-and-butter of incremental parsing — xgrammar's machinery handles exactly this,
industrialized.
Files
starter.py—JsonMachine(allowed / feed / accepting, modes documented) andconstrained_generate. Your work.solution.py— reference.test_lab.py— recognition both ways, illegal-feed errors, the depth-8 stack test, the brace-hating model, the json.loads fuzz oracle, and the truncation caveat (pushdown edition: depth that only grows).
Run
LAB_IMPL=starter pytest phase-12-structured-outputs/labs/lab-03-json-pushdown -q
pytest phase-12-structured-outputs/labs/lab-03-json-pushdown -q # reference
What the tests prove
| Test | What it pins |
|---|---|
test_recognizes_valid_json / test_rejects_invalid_json | The grammar, both directions — including {}}, {"a":}, and the trailing-comma classic |
test_illegal_feed_raises | The machine is also a validator; feeding it garbage is loud, not corrupting |
test_depth_is_the_stack | 8 levels opened, tracked, closed — the non-regular behavior, exercised. (Any fixed-state FSM fails some depth; the counter never does) |
test_brace_hating_model_still_emits_valid_json | The adversarial guarantee, CFG edition: } ranked dead last, output parses anyway — comma_or_close mode eventually offers nothing but structure-respecting choices |
test_fuzzed_preferences_always_parse_or_truncate_live | 50 random models against the independent oracle: every completed output parses, every truncated one is a live prefix. This is the test that caught the leading-zero bug — the lab's best argument made by its own history |
test_truncation_caveat_brace_lover_never_closes | Lab-01's caveat, worse: a {-loving model nests forever (each value slot opens another object), hits the cap with depth > 0, output unparseable. Prefix-valid ≠ complete, and recursion gives the failure infinite room |
Hitchhiker's notes
- From this machine to xgrammar: a JSON Schema adds constraints your subset
doesn't have (specific keys, types per key, string escapes, floats) — the grammar
grows, the principle doesn't: compile schema → CFG → pushdown automaton → per-state
token bitmasks (lab-01's lifting, over pushdown configurations). XGrammar's
research contribution is making that lifting fast despite the stack: most masking
decisions turn out to be context-independent (decidable from the mode alone) and
get precomputed; only the genuinely stack-dependent ones (your depth-consulting
closers) are evaluated at runtime. Your machine cleanly displays which decisions
are which — look at
allowed(): only two branches readdepth. - The speculative-decoding interaction (Phase 8): verifying k drafted tokens under a grammar means advancing the pushdown machine k times and rolling back on rejection — so the machine must be checkpointable (your machine: copy mode + depth; a full PDA: copy the stack). Feature composition is where structured-output engines earn their complexity; vLLM gates some combinations for exactly this reason.
- Per-request state, again: each request carries its own machine, advanced as
tokens commit (Phase 9 lab-04's isolation discipline — grammar state is one more
thing that must never leak between batch rows). In vLLM it lives alongside the
request in the structured-output manager, advanced in
update_from_output's neighborhood (Phase 1's loop, hosting yet another tenant). - Why not just retry until valid? Compute: invalid generations burn full generation cost each attempt, and complex schemas can have high rejection rates. Masking's cost is per-step and tiny. The mask also helps the model — at every step its probability mass is renormalized over only-valid continuations, so the model is never "off the rails" trying to recover from its own syntax error. Constrained models often produce better-content JSON too, for this reason.
Going further
- Add arrays (
[ value (',' value)* ]) — now two bracket types share the stack and a counter no longer suffices: you need an actual stack of{-vs-[symbols. You'll have crossed from counter automaton to true PDA, and the diff is ~15 lines that teach the distinction better than any textbook. - Add strings as values with escape handling (
"a\"b") — the mode machinery for escapes is exactly why real JSON grammars are bigger than people expect. - Lift to tokens: reuse lab-01's
allowed_tokensover this machine (a token is allowed iff feeding its chars never raises) with a multi-char vocab, and re-run the brace-hater test at the token level. You've now built, end to end, a miniature xgrammar.
References
- Dong et al., XGrammar: Flexible and Efficient Structured Generation (2024) — the context-independent/dependent split and the PDA machinery: https://arxiv.org/abs/2411.15100
upstream/vllm/v1/structured_output/backend_xgrammar.py— the integration; find the per-request grammar state and the bitmask path.- Chomsky, Three Models for the Description of Language (1956) — where "regex can't count braces" was proven, sixty-nine years before your fuzz test rediscovered its consequences: https://doi.org/10.1109/TIT.1956.1056813
- Lab-01 — the FSM floor this lab builds on; Phase 9 lab-01 — the hook both ride.