Lab 12-02 — JSON Schema Constrained, on Real vLLM [GPU-OPT]
The CPU labs built the theory bottom-up: masks from FSMs (lab-01), stacks for nesting
(lab-03). This lab runs the industrialized version — xgrammar via vLLM's
guided_json — and measures the property that justifies the whole phase: 50 of 50
schema-valid outputs constrained, versus a baseline that politely wraps its JSON in
markdown fences and apologies. You'll also watch the operational signatures the CPU
labs predicted: the first-request grammar-compile latency (the compile-time/runtime
split, on a wall clock) and the finish_reason: "length" truncation caveat, live.
No GPU? Don't panic. The captured run below carries the measurements; the reconciliation against labs 01/03 is the work.
Contents
- Why this lab exists
- Requirements
- Steps
- Captured output (real run, Qwen2.5-0.5B-Instruct, L4, vLLM 0.22.1, trimmed)
- Reading the results
- Hitchhiker's notes
- Reflect
- References
Why this lab exists
"100% valid JSON" is a strong claim and engineers should be professionally suspicious
of strong claims — this lab is the verification protocol. The design matters more
than the running: a fixed schema, N diverse prompts, two arms (constrained vs
unconstrained-but-asked-nicely), and a strict validator (jsonschema, not
json.loads — type and required-key checking, not just parseability). The
unconstrained arm is the control every structured-output benchmark needs and most
skip: without it, "98% valid" tells you nothing about what the constraint bought
(small instruct models often manage 60–85% unconstrained; the delta is the feature).
It's also your introduction to the feature's operational personality: per-schema
compile cost (cached thereafter), the scheduler's grammar-wait state, and the
interaction with max_tokens that labs 01/03 made you predict — all visible from the
client side if you know to look.
Requirements
uv pip install -e ".[vllm]" jsonschema
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct # small instruct model: a fair baseline arm
Steps
import json, jsonschema
from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams
SCHEMA = {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer", "minimum": 0},
"skills": {"type": "array", "items": {"type": "string"}},
},
"required": ["name", "age", "skills"],
}
llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct", gpu_memory_utilization=0.6)
prompts = [f"Generate a profile for a fictional {job}." for job in
["pirate", "astronaut", "barista", "wizard", "plumber"] * 10]
def validity(outputs):
ok = 0
for o in outputs:
try:
jsonschema.validate(json.loads(o.outputs[0].text), SCHEMA)
ok += 1
except Exception:
pass
return ok
base = llm.generate([p + " Respond ONLY with JSON matching the schema." for p in prompts],
SamplingParams(max_tokens=128, temperature=0.8))
guided = llm.generate(prompts, SamplingParams(
max_tokens=128, temperature=0.8,
guided_decoding=GuidedDecodingParams(json=SCHEMA)))
print(f"baseline: {validity(base)}/50 guided: {validity(guided)}/50")
Time the first guided request separately from the rest (grammar compile), and run
one guided request with max_tokens=12 to spring the truncation trap on purpose.
Captured output (real run, Qwen2.5-0.5B-Instruct, L4, vLLM 0.22.1, trimmed)
baseline: 31/50 guided: 50/50
# typical baseline failure: 'Sure! Here is the profile:\n```json\n{"name": ...'
# (valid JSON, wrapped in chat — json.loads sees the fence and dies)
# first guided request: +210 ms (xgrammar compile, then cached for the schema)
# guided with max_tokens=12: '{"name": "Captain Redb' finish_reason='length'
# (prefix-valid, incomplete — the labs' truncation caveat, on silicon)
Reading the results
- 31/50 baseline — and look at how it fails: mostly not malformed JSON but JSON wrapped in helpfulness ("Sure! Here is..."). Instruct-tuning taught the model to chat; your parser disagrees. Prompt-engineering harder buys a few points and plateaus — the failure is distributional, and no amount of asking changes the distribution's tails. (This is the precise sense in which masking is structural: it edits the distribution's support, not its mood.)
- 50/50 guided — the first token is forced into
{territory; the fence is unsamplable. Note this matches labs 01/03's adversarial tests exactly: the model's preferences (chatty preamble) lose to the mask, every time, by construction. - +210 ms first request — lab-01's compile-time/runtime split with a wall-clock number: schema → grammar → automaton → token bitmask tables, then cached (per schema × tokenizer). A fleet serving many distinct schemas pays this repeatedly — cache hit rate on grammars is a real metric for structured-output-heavy services.
- The truncated run —
finish_reason: "length"plus a prefix-valid fragment: labs 01/03's caveat verbatim. The defensive pattern: treat"length"+ structured-output as invalid regardless of how parseable the prefix looks, and sizemax_tokensfor the schema's worst case (arrays make worst cases long).
Hitchhiker's notes
- The API surface spans four formats:
guided_json(schema),guided_regex(lab-01's domain),guided_choice(the degenerate-but-useful enum case), andguided_grammar(full EBNF — lab-03's domain, user-supplied). All compile to the same masking machinery with different front ends; choosing the narrowest format that fits is both faster to compile and a better model-steering signal. - Backend choice exists (
xgrammardefault,guidance,outlineslineage) — like Phase 4's attention backends, with the same operational reflex: when structured output misbehaves, swapping backends is the bisection move (--guided-decoding-backend). Feature-support matrices differ (regex corners, schema keywords); the deep-dive maps them. - Quality inside validity: 50/50 valid says nothing about whether the content is good — masks constrain syntax, not sense. A model bullied through an unfamiliar schema produces valid-but-vapid fields. The schema is also a prompt: include it in the text and the constraint, and the two reinforce (measure content quality separately — Phase 6 lab-02's eval discipline applies).
- Throughput cost is real but modest: bitmask application is cheap; grammar advance (per accepted token, per request) is CPU-side work that can bottleneck at high concurrency with complex grammars — watch the structured-output scheduling stats. Tail risk: one pathological schema compiling for seconds can stall its request, not the engine (the async-compile design — Phase 3's WAITING state earning a new tenant).
Reflect
- Map every capture line to its CPU-lab origin: the fence failure (mask edits support — labs 01/03's adversarial tests), the +210 ms (lab-01's compile/runtime split), the truncation (both labs' caveat tests). If each has a home, the phase composed.
- Your service takes user-supplied schemas. Name the three operational risks this
lab armed you against. (Unbounded compile cost per novel schema — cache + limits;
worst-case output length vs
max_tokens— validatefinish_reason; pathological grammars as a DoS surface — compile timeouts.) - Why does the guided arm use temperature 0.8 rather than 0? (The claim under test is "valid under sampling" — greedy would make validity trivially repetitive and hide mask bugs that only sampled tails reach. Constrain the support, then let the distribution be itself.)
References
upstream/vllm/v1/structured_output/— manager, xgrammar backend, the async compile path and bitmask plumbing.- vLLM docs, Structured Outputs — the four guided formats and backend selection: https://docs.vllm.ai/en/latest/features/structured_outputs/
- Dong et al., XGrammar (2024): https://arxiv.org/abs/2411.15100
- Labs 01 and 03 — the theory this run industrializes; Phase 1 lab-05 —
finish_reason, doing load-bearing work again.