Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Lab 03-05 — Decode-Latency Spikes, and How Chunking Kills Them [CPU-OK]

Labs 01 and 02 built chunked prefill's mechanism and proved it safe. This lab supplies the missing piece: the motive. You'll put a short request mid-decode — a user happily watching tokens stream — and then slam a 256-token prompt into the engine. Without chunking, the decode stream takes one step that costs 257 tokens of work instead of 1: a ~250× inter-token latency spike, the infamous "my chat froze for a second" of early serving engines. With threshold=32, the same experiment caps every step at 33. You will produce both latency profiles yourself, as exact integer sequences, on a laptop.

Contents


Why this lab exists

Tail latency is where serving engineers earn their pay. Means are easy — any engine has a fine average inter-token latency; the product experience is set by the p99, and the p99 is set by exactly the event you're about to stage: someone else's prefill landing in your decode step. This interference is invisible in throughput numbers (the work all gets done!) and invisible in single-request benchmarks (no one to interfere with). You only see it by looking at per-step cost from the perspective of one victim stream — which is precisely the measurement you'll build, using the schedule-probe from Phase 1 lab-04 as your instrument.

This lab is also the Sarathi-Serve paper in a bottle. Their contribution — "stall-free scheduling" via chunked prefills piggybacked on decode batches — reduces, on this workload, to the difference between your two measured profiles. Papers compress well when you've run their experiment.

Background: step cost is the latency

In mini_vllm, steps are instant; on a GPU, a step's wall-clock time grows roughly with the tokens scheduled in it (they all go through the same forward pass — more tokens, more FLOPs, longer step). So for a decoding request, the time between its token k and token k+1 is the duration of the step that computes k+1including everyone else's work in that step. That's why this lab's metric is:

for each step in which the victim advances, the total scheduled tokens of that step.

It's a proxy with the right shape: a step of [A:1, B:256] is ~257 token-units long, and A's user waits all of it for one token. The proxy ignores second-order GPU effects (attention's memory traffic, kernel launch overheads — Phase 18 refines), but the first-order picture it gives is the one that drives the tuning decision.

Files

  • starter.py — implement decode_step_costs(...): stage the collision, probe the schedule, extract the victim's per-step costs. Recipe in the docstring. Your work.
  • solution.py — reference.
  • test_lab.py — the spike exists unchunked; the cap holds chunked; the work conserves; outputs are schedule-invariant; the victim is never starved.

Run

LAB_IMPL=starter pytest phase-03-continuous-batching-scheduler/labs/lab-05-decode-latency-spikes -q
pytest phase-03-continuous-batching-scheduler/labs/lab-05-decode-latency-spikes -q   # reference

What you should see — the two profiles

Real output of the solution (long_prompt_len=256, max_tokens=16):

threshold=0 :  [257, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
threshold=32:  [33, 33, 33, 33, 33, 33, 33, 33, 2, 1, 1, 1, 1, 1, 1]

Read them like latency traces, because that's what they are:

  • Unchunked: one monstrous step — B's entire 256-token prefill rides the same step as A's decode (256 + 1 = 257) — then calm. A's user experiences fifteen smooth tokens and one ~250× hiccup. This is a p99 disaster hiding in a perfect mean: the average cost is ~18, the median is 1. If you only monitor averages, this profile looks healthy. It isn't.
  • Chunked at 32: eight consecutive steps of exactly 33 (one 32-token chunk of B + A's 1 token — your lab-02 formula: ⌈256/32⌉ = 8 steps), then a 2 (B's first decode rides along), then 1s. The spike didn't shrink; it spread: the same 256 tokens of prefill work, conserved to the token, paid in eight 33× installments instead of one 257× balloon. Worst-case ITL for A drops ~8×; B's time-to-first-token rises (8 steps instead of 1). Nothing is free — chunking is a redistribution of latency from everyone's p99 to the long prompt's TTFT. That redistribution is almost always the right trade (decode stalls are user-visible jitter; prefill latency is a single wait users expect), but say it as a trade, not a win.
  • The 2s are worth a glance: B's own decode co-scheduled with A's. Mixed batches everywhere once you know to look (Phase 1 lab-04's "money step").

What the tests prove

TestWhat it pins
test_unchunked_prefill_spikes_one_decode_stepThe spike is real (≥ 257) and singular (second-worst step ≤ 3) — exactly one balloon payment
test_chunked_prefill_caps_the_spikeThe cap holds (≤ threshold + 1 for every step the victim shares) and the work conserved (≥ 8 elevated steps — the prefill didn't vanish, it spread)
test_chunking_does_not_change_the_decode_streams_outputLab-02's invariant under interference: identical token ids for the victim across both schedules
test_decode_stream_is_never_starvedThe victim advances every single step it's alive, chunked or not — running-first (lab-01) means interference delays decodes but never skips them

Together these four are the full contract of chunked prefill: bounded interference, conserved work, untouched outputs, guaranteed progress.

Hitchhiker's notes

  • Where's the threshold's floor? Push it down: at threshold=1, B's prefill takes 256 steps — A's latency is pristine and B's TTFT is catastrophic; and on real hardware, 256 tiny steps pay 256× the fixed per-step overhead (scheduler, launches, sampler), so total throughput sags too. The optimum is workload-dependent and that's the point: upstream exposes long_prefill_token_threshold (and the budget) rather than hardcoding an answer. Sarathi-Serve's evaluation is essentially this sweep with wall-clocks.
  • The budget is the other half of the dial. Chunks are bounded by min(threshold, remaining budget) — lab-01's clamp. A small max_num_batched_tokens caps interference globally (every step is small) at the cost of slower prefills for everyone. Production tuning usually sets budget for the worst acceptable step time, then threshold for fairness within it.
  • Why measure from the victim's seat? Because aggregate metrics hide exactly this. Mean step cost barely moves between the two profiles (the work is identical!); only per-victim-step cost shows the 257 vs 33. The general lesson for benchmarking serving systems: pick a request and follow it — fleet-wide averages are where tail pain goes to hide. This is why serious LLM benchmarks report TTFT and ITL distributions, never just tokens/sec (Phase 18).
  • Real-engine correspondence: in vLLM, run two clients — one streaming a long generation, one submitting a huge prompt — and watch the streamer's inter-chunk gaps with chunked prefill toggled (it's default-on in V1; you can throttle it via long_prefill_token_threshold). The wall-clock version of your integer profiles, jitter included.

Going further

  • Compute p50/p99 of A's step costs for thresholds {0, 16, 32, 64, 128, 256} and plot both against B's prefill-step count. The p99 curve falls as the TTFT curve rises; where they cross for your tolerance is the tuning answer. You've reproduced Sarathi-Serve Figure-1-style analysis with a 30-line probe.
  • Make it a storm: five long prompts arriving on consecutive steps while two victims decode. Does the cap still hold per step? (It must — the budget binds the sum.) What happens to admission order? (Lab-01's FCFS + head-of-line rules, now visible in data.)
  • Add wall-clock: time each eng.step() (even toy steps have measurable cost) and check the correlation between your token proxy and real microseconds. Weak on a toy model, strong on a GPU — knowing when a proxy is valid is half of performance engineering.

References