Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 12-02 — JSON Schema Constrained, on Real vLLM [GPU-OPT]

The CPU labs built the theory bottom-up: masks from FSMs (lab-01), stacks for nesting (lab-03). This lab runs the industrialized version — xgrammar via vLLM's guided_json — and measures the property that justifies the whole phase: 50 of 50 schema-valid outputs constrained, versus a baseline that politely wraps its JSON in markdown fences and apologies. You'll also watch the operational signatures the CPU labs predicted: the first-request grammar-compile latency (the compile-time/runtime split, on a wall clock) and the finish_reason: "length" truncation caveat, live.

No GPU? Don't panic. The captured run below carries the measurements; the reconciliation against labs 01/03 is the work.

Contents


Why this lab exists

"100% valid JSON" is a strong claim and engineers should be professionally suspicious of strong claims — this lab is the verification protocol. The design matters more than the running: a fixed schema, N diverse prompts, two arms (constrained vs unconstrained-but-asked-nicely), and a strict validator (jsonschema, not json.loads — type and required-key checking, not just parseability). The unconstrained arm is the control every structured-output benchmark needs and most skip: without it, "98% valid" tells you nothing about what the constraint bought (small instruct models often manage 60–85% unconstrained; the delta is the feature).

It's also your introduction to the feature's operational personality: per-schema compile cost (cached thereafter), the scheduler's grammar-wait state, and the interaction with max_tokens that labs 01/03 made you predict — all visible from the client side if you know to look.

Requirements

uv pip install -e ".[vllm]" jsonschema
huggingface-cli download Qwen/Qwen2.5-0.5B-Instruct   # small instruct model: a fair baseline arm

Steps

import json, jsonschema
from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams

SCHEMA = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer", "minimum": 0},
        "skills": {"type": "array", "items": {"type": "string"}},
    },
    "required": ["name", "age", "skills"],
}

llm = LLM(model="Qwen/Qwen2.5-0.5B-Instruct", gpu_memory_utilization=0.6)
prompts = [f"Generate a profile for a fictional {job}." for job in
           ["pirate", "astronaut", "barista", "wizard", "plumber"] * 10]

def validity(outputs):
    ok = 0
    for o in outputs:
        try:
            jsonschema.validate(json.loads(o.outputs[0].text), SCHEMA)
            ok += 1
        except Exception:
            pass
    return ok

base = llm.generate([p + " Respond ONLY with JSON matching the schema." for p in prompts],
                    SamplingParams(max_tokens=128, temperature=0.8))
guided = llm.generate(prompts, SamplingParams(
    max_tokens=128, temperature=0.8,
    guided_decoding=GuidedDecodingParams(json=SCHEMA)))
print(f"baseline: {validity(base)}/50   guided: {validity(guided)}/50")

Time the first guided request separately from the rest (grammar compile), and run one guided request with max_tokens=12 to spring the truncation trap on purpose.

Captured output (real run, Qwen2.5-0.5B-Instruct, L4, vLLM 0.22.1, trimmed)

baseline: 31/50   guided: 50/50
# typical baseline failure: 'Sure! Here is the profile:\n```json\n{"name": ...'
#   (valid JSON, wrapped in chat — json.loads sees the fence and dies)
# first guided request: +210 ms (xgrammar compile, then cached for the schema)
# guided with max_tokens=12: '{"name": "Captain Redb'  finish_reason='length'
#   (prefix-valid, incomplete — the labs' truncation caveat, on silicon)

Reading the results

  • 31/50 baseline — and look at how it fails: mostly not malformed JSON but JSON wrapped in helpfulness ("Sure! Here is..."). Instruct-tuning taught the model to chat; your parser disagrees. Prompt-engineering harder buys a few points and plateaus — the failure is distributional, and no amount of asking changes the distribution's tails. (This is the precise sense in which masking is structural: it edits the distribution's support, not its mood.)
  • 50/50 guided — the first token is forced into { territory; the fence is unsamplable. Note this matches labs 01/03's adversarial tests exactly: the model's preferences (chatty preamble) lose to the mask, every time, by construction.
  • +210 ms first request — lab-01's compile-time/runtime split with a wall-clock number: schema → grammar → automaton → token bitmask tables, then cached (per schema × tokenizer). A fleet serving many distinct schemas pays this repeatedly — cache hit rate on grammars is a real metric for structured-output-heavy services.
  • The truncated runfinish_reason: "length" plus a prefix-valid fragment: labs 01/03's caveat verbatim. The defensive pattern: treat "length" + structured-output as invalid regardless of how parseable the prefix looks, and size max_tokens for the schema's worst case (arrays make worst cases long).

Hitchhiker's notes

  • The API surface spans four formats: guided_json (schema), guided_regex (lab-01's domain), guided_choice (the degenerate-but-useful enum case), and guided_grammar (full EBNF — lab-03's domain, user-supplied). All compile to the same masking machinery with different front ends; choosing the narrowest format that fits is both faster to compile and a better model-steering signal.
  • Backend choice exists (xgrammar default, guidance, outlines lineage) — like Phase 4's attention backends, with the same operational reflex: when structured output misbehaves, swapping backends is the bisection move (--guided-decoding-backend). Feature-support matrices differ (regex corners, schema keywords); the deep-dive maps them.
  • Quality inside validity: 50/50 valid says nothing about whether the content is good — masks constrain syntax, not sense. A model bullied through an unfamiliar schema produces valid-but-vapid fields. The schema is also a prompt: include it in the text and the constraint, and the two reinforce (measure content quality separately — Phase 6 lab-02's eval discipline applies).
  • Throughput cost is real but modest: bitmask application is cheap; grammar advance (per accepted token, per request) is CPU-side work that can bottleneck at high concurrency with complex grammars — watch the structured-output scheduling stats. Tail risk: one pathological schema compiling for seconds can stall its request, not the engine (the async-compile design — Phase 3's WAITING state earning a new tenant).

Reflect

  • Map every capture line to its CPU-lab origin: the fence failure (mask edits support — labs 01/03's adversarial tests), the +210 ms (lab-01's compile/runtime split), the truncation (both labs' caveat tests). If each has a home, the phase composed.
  • Your service takes user-supplied schemas. Name the three operational risks this lab armed you against. (Unbounded compile cost per novel schema — cache + limits; worst-case output length vs max_tokens — validate finish_reason; pathological grammars as a DoS surface — compile timeouts.)
  • Why does the guided arm use temperature 0.8 rather than 0? (The claim under test is "valid under sampling" — greedy would make validity trivially repetitive and hide mask bugs that only sampled tails reach. Constrain the support, then let the distribution be itself.)

References