Lab 12-03 — JSON Needs a Stack: the Pushdown Mask `[CPU-OK]`

Try to write lab-01's FSM for "balanced nested braces" and you'll hit a wall that's been a theorem since 1956: a finite-state machine cannot count unbounded nesting — matching { to } at arbitrary depth requires memory that grows, i.e. a stack. JSON nests. Therefore JSON is not a regular language, regex-based masking cannot enforce it, and the production answer (xgrammar) compiles context-free grammars to pushdown automata. In this lab you build the pushdown machine for a JSON subset — modes plus a depth counter that is the stack — and prove the strongest property in the phase: a model that hates closing braces (they're its least-preferred character) still emits output that json.loads accepts. Along the way the fuzz oracle will catch a grammar corner most humans forget (this lab's reference implementation forgot it too, the first time): JSON forbids leading zeros.

Why this lab exists
Background: modes + depth = pushdown
Files
Run
What the tests prove
Hitchhiker's notes
Going further
References

Why this lab exists

The regular/context-free boundary is the single most practical piece of formal- language theory in modern serving, because it's a product boundary: "constrain to a regex" and "constrain to a JSON schema" are different features with different engines, different compile costs, and different failure modes — and engineers who don't know why ship the wrong one. After this lab the boundary is physical: you'll have built both machines and the difference is one integer (depth) that lab-01's FSM has nowhere to put.

The second lesson is oracle-driven grammar debugging, courtesy of an honest war story: the fuzz test here validates completed generations with json.loads — an independent, stricter implementation of the spec — and on this lab's first draft it rejected "0123", which the hand-built machine happily accepted. The machine's grammar was wrong (JSON ints are '0' | [1-9][0-9]*), and no amount of testing the machine against itself would have found it. Constrained decoding is only as correct as its grammar; always fuzz against an independent parser. That habit is worth more than the lab.

Background: modes + depth = pushdown

The machine is lab-01's FSM plus one number. Modes track position inside the local grammar production (value, int, int_zero, obj_first, key, key_body, colon, comma_or_close, obj_key, done); depth counts open objects — and because this grammar's only recursive construct is the object, depth is the entire stack (a general CFG pushes symbols; we push indistinguishable ones, so a counter suffices — a nice special case to notice). The two rules that make it click:

'{' in value mode pushes (depth+1) and enters obj_first; '}' pops and then asks: stack empty? → done (only EOS legal); else → comma_or_close (the enclosing object resumes). That resume-the-parent move is what no FSM can express — the parent's state was remembered by the stack.
Ending a value is context-dependent: a top-level int may end at EOS; the same int inside an object must be followed by , or }. Hence allowed() consults depth — the mask itself is stack-aware, which is the whole point.

One implementation subtlety worth savoring: feeding , while in int mode must first end the int, then re-dispatch the comma in the new mode (feed calls itself once). Tokens that terminate one production and belong to the next are the bread-and-butter of incremental parsing — xgrammar's machinery handles exactly this, industrialized.

Files

starter.py — JsonMachine (allowed / feed / accepting, modes documented) and constrained_generate. Your work.
solution.py — reference.
test_lab.py — recognition both ways, illegal-feed errors, the depth-8 stack test, the brace-hating model, the json.loads fuzz oracle, and the truncation caveat (pushdown edition: depth that only grows).

Run

LAB_IMPL=starter pytest phase-12-structured-outputs/labs/lab-03-json-pushdown -q
pytest phase-12-structured-outputs/labs/lab-03-json-pushdown -q   # reference

What the tests prove

Test	What it pins
`test_recognizes_valid_json` / `test_rejects_invalid_json`	The grammar, both directions — including `{}}`, `{"a":}`, and the trailing-comma classic
`test_illegal_feed_raises`	The machine is also a validator; feeding it garbage is loud, not corrupting
`test_depth_is_the_stack`	8 levels opened, tracked, closed — the non-regular behavior, exercised. (Any fixed-state FSM fails some depth; the counter never does)
`test_brace_hating_model_still_emits_valid_json`	The adversarial guarantee, CFG edition: `}` ranked dead last, output parses anyway — `comma_or_close` mode eventually offers nothing but structure-respecting choices
`test_fuzzed_preferences_always_parse_or_truncate_live`	50 random models against the independent oracle: every completed output parses, every truncated one is a live prefix. This is the test that caught the leading-zero bug — the lab's best argument made by its own history
`test_truncation_caveat_brace_lover_never_closes`	Lab-01's caveat, worse: a `{`-loving model nests forever (each `value` slot opens another object), hits the cap with depth > 0, output unparseable. Prefix-valid ≠ complete, and recursion gives the failure infinite room

Hitchhiker's notes

From this machine to xgrammar: a JSON Schema adds constraints your subset doesn't have (specific keys, types per key, string escapes, floats) — the grammar grows, the principle doesn't: compile schema → CFG → pushdown automaton → per-state token bitmasks (lab-01's lifting, over pushdown configurations). XGrammar's research contribution is making that lifting fast despite the stack: most masking decisions turn out to be context-independent (decidable from the mode alone) and get precomputed; only the genuinely stack-dependent ones (your depth-consulting closers) are evaluated at runtime. Your machine cleanly displays which decisions are which — look at allowed(): only two branches read depth.
The speculative-decoding interaction (Phase 8): verifying k drafted tokens under a grammar means advancing the pushdown machine k times and rolling back on rejection — so the machine must be checkpointable (your machine: copy mode + depth; a full PDA: copy the stack). Feature composition is where structured-output engines earn their complexity; vLLM gates some combinations for exactly this reason.
Per-request state, again: each request carries its own machine, advanced as tokens commit (Phase 9 lab-04's isolation discipline — grammar state is one more thing that must never leak between batch rows). In vLLM it lives alongside the request in the structured-output manager, advanced in update_from_output's neighborhood (Phase 1's loop, hosting yet another tenant).
Why not just retry until valid? Compute: invalid generations burn full generation cost each attempt, and complex schemas can have high rejection rates. Masking's cost is per-step and tiny. The mask also helps the model — at every step its probability mass is renormalized over only-valid continuations, so the model is never "off the rails" trying to recover from its own syntax error. Constrained models often produce better-content JSON too, for this reason.

Going further

Add arrays ([ value (',' value)* ]) — now two bracket types share the stack and a counter no longer suffices: you need an actual stack of {-vs-[ symbols. You'll have crossed from counter automaton to true PDA, and the diff is ~15 lines that teach the distinction better than any textbook.
Add strings as values with escape handling ("a\"b") — the mode machinery for escapes is exactly why real JSON grammars are bigger than people expect.
Lift to tokens: reuse lab-01's allowed_tokens over this machine (a token is allowed iff feeding its chars never raises) with a multi-char vocab, and re-run the brace-hater test at the token level. You've now built, end to end, a miniature xgrammar.

References

Dong et al., XGrammar: Flexible and Efficient Structured Generation (2024) — the context-independent/dependent split and the PDA machinery: https://arxiv.org/abs/2411.15100
upstream/vllm/v1/structured_output/backend_xgrammar.py — the integration; find the per-request grammar state and the bitmask path.
Chomsky, Three Models for the Description of Language (1956) — where "regex can't count braces" was proven, sixty-nine years before your fuzz test rediscovered its consequences: https://doi.org/10.1109/TIT.1956.1056813
Lab-01 — the FSM floor this lab builds on; Phase 9 lab-01 — the hook both ride.

vLLM Mastery — From Zero to Maintainer