Lab 16-03 — The Streaming Detokenizer: Never Emit Broken UTF-8 `[CPU-OK]`

Streaming sends text the instant tokens arrive — but token boundaries and character boundaries don't align. With mini_vllm's ByteTokenizer the problem is stark: 🚀 is four byte-tokens, and decoding after each token emits replacement-character garbage (�) three times before the rocket completes. Real BPE tokenizers have the identical problem wherever a multibyte character spans tokens (CJK text, emoji, accents — i.e. most of the world's traffic). This lab builds the fix every serving stack carries: an incremental detokenizer that only ever emits complete characters, holding incomplete byte sequences until they finish — with a control test proving the naive approach really does produce four pieces of garbage where yours produces three silences and a rocket.

Why this lab exists
Background: UTF-8 tells you how long to wait
Files
Run
What the tests prove
Hitchhiker's notes
Going further
References

Why this lab exists

This bug ships constantly. Every few months a chat product somewhere streams � mid- emoji or garbles Chinese text, because someone decoded per-token and tested only in English — ASCII is the one alphabet where token and character boundaries happen to agree, which makes English-only testing a perfect blind spot. The fix is small but must live in the streaming path (vLLM's IncrementalDetokenizer holds back exactly these bytes), and implementing it once inoculates you: afterward, "stream text" and "stream complete characters" register as different operations, the way Phase 9 taught "random" and "reproducibly random" to.

It's also the purest specimen of the phase's recurring discipline — lab-01 held back possible tag prefixes, stop-string handling (Phase 1 lab-05) holds back possible stop matches, and this lab holds back incomplete characters. One pattern, three layers: emit eagerly, but never emit what might still change meaning.

Background: UTF-8 tells you how long to wait

UTF-8's self-describing first byte is what makes the fix clean: 0xxxxxxx = 1-byte char, 110xxxxx = 2, 1110xxxx = 3, 11110xxx = 4 — your utf8_expected_len table. The detokenizer keeps a byte buffer; after each token it computes the longest prefix that is a whole number of complete sequences, decodes and emits that, and keeps the tail. The lead byte announces the wait; no guessing, no decode-and-check. flush() handles the honest edge: a stream truncated mid-character (max_tokens landing inside an emoji — Phase 1 lab-05's cap, byte edition) decodes the remnant with errors='replace', because at end-of-stream the garbage is real and hiding it would be lying.

Files

starter.py — utf8_expected_len and StreamingDetokenizer (feed/flush). Your work.
solution.py — reference.
test_lab.py — the length table, ASCII eagerness, emoji holding, the no-garbage-ever invariant on mixed multilingual text, the naive-approach control, truncation honesty, and EOS handling.

Run

LAB_IMPL=starter pytest phase-16-serving-apis-and-parsers/labs/lab-03-streaming-detokenizer -q
pytest phase-16-serving-apis-and-parsers/labs/lab-03-streaming-detokenizer -q   # reference

What the tests prove

Test	What it pins
`test_ascii_streams_one_char_per_token`	Eagerness: nothing is held that could be shown — latency is sacrificed only when correctness demands
`test_emoji_is_held_until_complete`	`["", "", "", "🚀"]` — three silences, one rocket: the wait is exactly the character's length, no more
`test_never_emits_replacement_chars_for_valid_text`	The invariant, on `naïve café — 你好 🚀🇫🇷`: no `�` ever, and concatenation loses nothing (both halves matter: no garbage AND no swallowing)
`test_naive_approach_really_is_broken`	The control (Phase 9 lab-04's pattern): per-token decode of 🚀 yields four garbage strings — the bug demonstrated, not described
`test_flush_handles_truncated_sequence`	Stream cut mid-emoji: flush emits honest `�` rather than raising or hiding — truncation is the caller's fact to handle
`test_eos_is_ignored`	Non-byte ids pass through silently — the sentinel discipline again

Hitchhiker's notes

The real version sits one level up: BPE tokens map to byte sequences (via the tokenizer's byte-level encoding), so vLLM's incremental detokenizer (upstream/vllm/v1/engine/detokenizer.py, backed by the tokenizers library's incremental decode) buffers token-ids and re-decodes a sliding window — same hold-back logic with tokenizer-specific machinery for "which prefix is stable." Your byte-level version is that algorithm with the cleanest possible alphabet.
The flag emoji in the test is a deliberate landmine that doesn't explode: 🇫🇷 is two complete 4-byte codepoints (regional indicators) that render as one flag. Your detokenizer may legally emit them separately — character completeness is the engine's contract; grapheme clustering is the terminal's problem. Knowing where your responsibility ends is part of the spec (and why the test checks for �, not for atomic flags).
This buffering interacts with everything downstream: stop strings are matched on detokenized text (so they inherit this buffer's timing), and lab-01's tag parser consumes this lab's output. The serving text pipeline is a stack of hold-back buffers, each with its own "might still change" criterion — when streamed output seems to lag by a character or two, you now know all three suspects.
Performance note: production detokenizers avoid re-decoding from scratch per token (your _complete_prefix_len scan is O(buffer), fine; re-decoding the whole output per token, the other naive approach, is O(n²) over a generation and has caused real regressions). Incrementality is a performance property here, not just a correctness one.

Going further

Build the full pipeline: ByteTokenizer → StreamingDetokenizer → lab-01's StreamingToolParser, fed token-by-token; assert end-to-end that a tool call with an emoji in its arguments survives both buffers. Two hold-backs composed — the actual server path.
Add stop-string support on top (Phase 1 lab-05's going-further, now with the right substrate): match on the emitted text, hold back any suffix that prefixes a stop string. Three buffers. Notice they compose without coordinating — each one's output is the next one's honest input.
Measure the worst-case display latency your buffer adds for a pathological all-emoji stream — then check the real detokenizer's equivalent bound. (Four tokens. The wait is bounded by UTF-8's max sequence length; this is why the design needs no timeout.)

References

upstream/vllm/v1/engine/detokenizer.py — IncrementalDetokenizer: this lab at the BPE level.
The Unicode Standard, ch. 3 (UTF-8) — the lead-byte table you implemented: https://www.unicode.org/versions/latest/
Phase 1 lab-05 — stop strings, the neighboring hold-back; lab-01 — the tag hold-back this lab feeds.

vLLM Mastery — From Zero to Maintainer