Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 00 — The Hitchhiker's Guide to How an LLM Actually Runs

Course home · Phase 01

This is Chapter 0 of the book. It assumes you know nothing — not what a token is, not what a matrix multiply is — and it ends with you able to compute, on a napkin, how many users a given GPU can serve and why. Everything else in the course stands on this chapter, so we go slowly and build each idea from the ground up.

How to read this chapter. Most of it is for everyone. Paragraphs marked

🔬 Going deeper — optional rigor and real numbers for the expert track.

can be skimmed on a first pass and devoured on the second. By the end you should be comfortable at both levels: the intuition and the arithmetic.


Contents


0.1 Don't Panic — the whole thing in one sentence

A large language model is a function that reads a list of words and guesses the next word. To write text, it guesses a word, sticks it on the end, and guesses again — hundreds of times.

That is genuinely all an LLM does at runtime. ChatGPT writing you an essay is this loop running a few hundred times. Everything difficult — and everything this course teaches — comes not from the guessing, but from doing the guessing fast, for thousands of people at once, on hardware that costs more than a house. vLLM is the software that does that well. To make it faster (your future job), you must first feel why it is slow. That feeling is what this chapter installs.

We'll build up in this order: a tiny bit of math → words become numbers → what one "guess" involves → the loop → why it's slow → where the memory goes. Take your time.


0.2 The only math you need (a 5-minute primer)

Two objects and one operation. That's it.

A vector is just a list of numbers: [0.2, -1.1, 0.5]. Picture it as an arrow, or as coordinates of a point. A length-3 vector is a point in 3D space; LLMs use points in thousands of dimensions (you can't picture that, and you don't need to — the arithmetic is the same).

A matrix is a grid of numbers — a stack of vectors. A 2×3 matrix has 2 rows, 3 columns.

The one operation that matters is matrix multiplication (everyone calls it "matmul" or GEMM — General Matrix Multiply). To multiply a vector by a matrix, you take dot products. A dot product of two equal-length vectors multiplies them element-wise and sums:

[1, 2, 3] · [4, 5, 6]  =  1·4 + 2·5 + 3·6  =  4 + 10 + 18  =  32

A dot product is a similarity score: it's large and positive when two vectors point the same way, near zero when they're unrelated, negative when opposed. Remember this — attention (the heart of the model) is built entirely out of dot products measuring "how related are these two tokens."

Multiplying a vector x (length 3) by a matrix W (3 columns, 2 rows) gives a new vector (length 2), one dot product per row of W:

x = [1, 2, 3]            W = [ [1, 0, 1],     →   y[0] = [1,2,3]·[1,0,1] = 4
                               [0, 1, 1] ]        y[1] = [1,2,3]·[0,1,1] = 5
                                                  y = [4, 5]

🔬 Going deeper. A neural network "layer" is exactly this: y = x·Wᵀ (plus a bias, plus a nonlinearity). The matrix W is the weights — the billions of numbers that are the trained model. "Llama-3-8B" means ~8 billion such numbers. A forward pass is a long chain of these multiplies. So "running a model" = "doing a lot of matmuls with the weight matrices." Hold that: it explains both the compute cost (FLOPs) and the memory cost (reading the weights), which is the whole performance story in §0.10.

That's the entire math prerequisite. Onward.


0.3 Words become numbers, part 1: tokenization

A computer can't multiply the word "Paris". So step one is always: chop the text into small pieces called tokens and replace each with an integer ID.

Why not just use whole words, or individual letters? Whole words give a gigantic, brittle vocabulary (every plural, typo, and rare name is a new word). Single letters make sequences painfully long. The sweet spot is subwords — common words stay whole, rare words split into pieces. The dominant algorithm is Byte-Pair Encoding (BPE):

🔬 How BPE is built (Going deeper). Start with every character as its own token. Then repeatedly find the most frequent adjacent pair of tokens in your training text and merge it into a new token. Do this tens of thousands of times. Common sequences like "ing", " the", "vLLM" get merged into single tokens; rare strings stay split into smaller bits. The result is a fixed list of merges (the vocabulary) that balances vocabulary size against sequence length.

A worked tokenization (Llama-3-style, ~128k vocab):

text:    "vLLM is fast"
tokens:  [ "v",  "LLM",  " is",  " fast" ]      ← note the leading spaces are part of tokens
IDs:     [  85,   4178,    382,    2347   ]      ← example numbers from the vocab table

Two facts to carry forward:

  • A token is roughly ¾ of a word on average (so 1,000 tokens ≈ 750 words).
  • The component that does this is the tokenizer; reversing it (IDs → text) at the end is detokenization. The full list of tokens it knows is the vocabulary (Llama-3: ~128,256).

In mini_vllm/tokenizer.py we use the simplest possible tokenizer — one byte = one token, vocab of 257 — so the course needs zero downloaded files. Open it; it's ten lines, and it has the same encode/decode interface a real tokenizer does.

🆕 New words: token (a subword chunk), token ID (its integer), tokenizer (the chopper), vocabulary (all known tokens), BPE (the merge algorithm), detokenization (IDs→text).


0.4 Words become numbers, part 2: embeddings (meaning as coordinates)

A token ID like 4178 is just a name — it carries no meaning by itself (token 4178 isn't "more" than token 382). So the model's first move is to look up each ID in a big table and replace it with a vector — a list of numbers — called an embedding.

Think of the embedding as coordinates in a space of meaning. Just as a city has a (latitude, longitude), a token has a few thousand coordinates (Llama-3-8B: 4096 of them). The training process arranges this space so that tokens used in similar ways land near each other, and — famously — directions in the space carry meaning:

embedding("king") - embedding("man") + embedding("woman")  ≈  embedding("queen")

The lookup itself is trivial — it's just "go to row 4178 of the embedding table" — but it's the bridge from symbols to math. After this step, the prompt is no longer text; it's a stack of vectors (one per token), and from here on the model only does matmuls on those vectors.

"vLLM is fast"
  → IDs [85, 4178, 382, 2347]
  → embeddings, a 4 × 4096 matrix:
      [ [ 0.02, -0.7, ... , 0.1 ],     ← "v"
        [-0.5,   0.3, ... , 0.9 ],     ← "LLM"
        [ 0.1,  -0.2, ... , 0.0 ],     ← " is"
        [ 0.8,   0.4, ... ,-0.3 ] ]    ← " fast"

🔬 Going deeper. The width of these vectors is the hidden size d_model (4096 for Llama-3-8B). Bigger d_model = more capacity but more compute and memory everywhere. The embedding table has vocab_size × d_model numbers (128256 × 4096 ≈ 525M just for embeddings). Many models tie the input embedding and the output projection (the "LM head") to save that memory.

🆕 New words: embedding (a token's meaning-vector), hidden size / d_model (the vector width), LM head (the final layer turning vectors back into vocabulary scores).


0.5 The shape of a model: layers and the residual stream

The model is a tall stack of identical layers (Llama-3-8B has 32). The stack of token vectors flows up through them; this flowing stack is often called the residual stream. Each layer reads the stream, computes an update, and adds it back (that "add it back" is the residual connection — it's why training deep stacks works, but you can treat it as plumbing).

Each layer does exactly two things to the stream:

  1. Attention — lets each token look at the other tokens and pull in relevant information. This is the only place where information moves between positions.
  2. MLP (feed-forward) — transforms each token's vector independently, adding "thinking capacity."

After all 32 layers, the final vector at the last position is multiplied by the LM head to produce one score for every token in the vocabulary. Those scores are the logits (§0.8).

embeddings ─► [ layer 1 ] ─► [ layer 2 ] ─► ... ─► [ layer 32 ] ─► LM head ─► logits (vocab,)
                 │ attention + MLP, each with a residual add

Keep this split in mind — attention mixes across tokens, the MLP works per token — because it explains exactly where the KV cache lives (in attention, the only cross-token part) and where the giant matmuls are (the MLP, ~⅔ of the weights; Phase 7).


0.6 Attention from the ground up (the heart of the machine)

This is the one piece worth understanding in detail, because it dictates all of memory management. We'll build it from the problem, then do a worked numeric example.

The problem attention solves

Consider generating the next word of: "The river bank was muddy, so the fisherman ...". To continue sensibly the model must know "bank" means a riverbank (not a money bank) — and the clue is the word "river" earlier. Each token needs to gather context from the other tokens. Attention is the mechanism for that gathering.

Q, K, V — the search analogy

For every token, the model computes three vectors (each by multiplying the token's embedding by a learned weight matrix — yes, more matmuls):

  • Query (Q)"here is what I'm looking for." (your search box)
  • Key (K)"here is what I am about." (a document's title/tags)
  • Value (V)"here is the information I actually carry." (the document's contents)

To update a token, you compare its Query against every earlier token's Key (dot products — remember, dot product = similarity!), turn those similarities into weights that sum to 1, and take the weighted blend of those tokens' Values. It is exactly a soft search:

Type a query → it's scored against the keys of all documents → you get back a blend of the best-matching documents' values.

A worked example (do this once by hand — it demystifies everything)

Three tokens, and to keep it readable let each Q/K/V be just 2 numbers (a real model uses 128). Suppose we're computing the new vector for token 3, whose query is q₃ = [1, 0]. The three tokens' keys and values:

token   key K        value V
  1     [1, 0]       [10, 0]
  2     [0, 1]       [0, 10]
  3     [1, 0]       [5, 5]

Step 1 — similarity scores (dot product of q₃ with each key):

score₁ = [1,0]·[1,0] = 1      score₂ = [1,0]·[0,1] = 0      score₃ = [1,0]·[1,0] = 1

Token 3's query points the same way as tokens 1 and 3's keys (score 1) and is orthogonal to token 2's (score 0).

Step 2 — scale by 1/√(head_dim) = 1/√2 ≈ 0.707 (this keeps numbers from blowing up as the vectors get wider — a numerical-stability trick):

scaled = [0.707, 0, 0.707]

Step 3 — softmax turns scores into weights that are all positive and sum to 1. Softmax of a list is exp(each) / sum(exp):

exp(0.707)=2.03,  exp(0)=1.00,  exp(0.707)=2.03    sum = 5.06
weights = [2.03/5.06, 1.00/5.06, 2.03/5.06] = [0.40, 0.20, 0.40]

So token 3 will pay 40% attention to token 1, 20% to token 2, 40% to itself.

Step 4 — weighted blend of the Values:

out = 0.40·[10,0] + 0.20·[0,10] + 0.40·[5,5]
    = [4,0]       + [0,2]       + [2,2]
    = [6, 4]

That [6, 4] is token 3's attention output — a context-aware mix dominated by tokens 1 and 3. That is the entire attention operation. Everything fancy later (FlashAttention, PagedAttention) is about computing this exact thing faster and with less memory, never something different.

Causal masking — you can't read the future

When generating, token 3 may only attend to tokens 1, 2, 3 — not tokens that come after it (they don't exist yet). This "only look backward" rule is causal masking: before the softmax, scores for future positions are set to -∞ (so their softmax weight is 0). Picture the allowed attention as a lower-triangular matrix:

       attends to →   t1   t2   t3   t4
   query t1            ✓    ✗    ✗    ✗
   query t2            ✓    ✓    ✗    ✗
   query t3            ✓    ✓    ✓    ✗
   query t4            ✓    ✓    ✓    ✓

This triangle is why token i needs the Keys and Values of all tokens ≤ i — the single most important sentence for understanding the KV cache (§0.9).

Multiple heads

Real attention runs several of these in parallel — heads — each with its own Q/K/V projections, so different heads can specialize (one tracks syntax, another long-range references). Their outputs are concatenated. Llama-3-8B has 32 query heads, each of dimension 128 (32 × 128 = 4096 = d_model).

🔬 Going deeper — three things the experts know.

  • head_dim and the √d scale. Dot products of d-dimensional vectors grow like √d, which would push softmax into tiny-gradient saturation. Dividing by √head_dim keeps the variance ~1. (That's the 0.707 above.)
  • RoPE (positional info). Attention as described is order-blind — it'd treat "dog bites man" like "man bites dog." Models inject position by rotating Q and K by an angle proportional to their position (Rotary Position Embedding). Two tokens' score then depends on their relative distance. You'll see rotary_emb(positions, q, k) in llama.py (Phase 0 deep-dive).
  • GQA/MQA — fewer KV heads. The KV cache (next section) is sized by the number of KV heads. So modern models use Grouped-Query Attention: many query heads share a smaller number of KV heads. Llama-3-8B has 32 query heads but only 8 KV heads — a 4× KV-cache saving baked into the architecture. (MQA is the extreme: 1 KV head.) This single design choice changes your serving capacity by 4×; remember it for §0.11.

🆕 New words: Query/Key/Value (Q/K/V), attention score (a Q·K dot product), softmax (scores → weights summing to 1), causal mask (can't attend to the future), head (one parallel attention), head_dim (a head's width), RoPE (rotary positions), GQA/MQA (shared KV heads).


0.7 The MLP (the per-token "thinking" block)

After attention mixes context in, the MLP processes each token's vector on its own through two big matmuls with a nonlinearity between:

hidden = activation(x · W_upᵀ)        # expand: 4096 → ~14336  (Llama-3-8B)
out    = hidden · W_downᵀ             # project back: 14336 → 4096

The middle width (the "intermediate size") is several × d_model, which is why the MLP holds the majority of the model's weights (~⅔). When you hear "the model is mostly GEMMs," it's largely these two matrices per layer. (Modern Llamas use a gated variant, SwiGLU, with three matrices — a detail for Phase 14; the shape story is the same.)

The takeaway for performance: attention is where memory (the KV cache) concentrates; the MLP is where compute and weight bytes concentrate. Different bottlenecks, different optimizations.


0.8 From logits to a token: sampling

After the last layer, the LM head turns the final vector into logits — one raw score per vocab token (~128k numbers). To pick a token you first turn logits into probabilities with softmax, then choose:

"The capital of France is"  → logits → softmax → probabilities:
   " Paris": 0.87    " Lyon": 0.06    " a": 0.01    " banana": 0.000003   ...
  • Greedy: take the highest-probability token (" Paris"). Deterministic.
  • Temperature / top-k / top-p: deliberately introduce randomness for variety.

This whole topic — the decoding algorithms and how they run vectorized across a whole batch — is Phase 9. For now: logits → softmax → pick one. mini_vllm/sampler.py implements greedy plus temperature/top-k/top-p in ~40 readable lines.

🆕 New words: logits (raw next-token scores), probability distribution (softmaxed logits), greedy decoding (argmax), sampling (random pick).


0.9 The generation loop, and the redundancy that births the KV cache

The model only ever predicts one next token. To write a sentence we loop — feed the output back in (this is autoregressive generation):

Step 1:  "The capital of France is"            → " Paris"
Step 2:  "The capital of France is Paris"       → "."
Step 3:  "The capital of France is Paris."      → <end>

Now look closely at what the naive loop computes. Each step runs the whole model over the whole text-so-far. Recall from §0.6 that attention at each position needs the Keys and Values of all earlier positions. So:

Step 1 processes tokens [1..5]      → computes K,V for positions 1..5
Step 2 processes tokens [1..6]      → computes K,V for positions 1..6   (1..5 AGAIN)
Step 3 processes tokens [1..7]      → computes K,V for positions 1..7   (1..6 AGAIN)

We keep recomputing the K and V of tokens we already processed. Here is the key insight:

A token's Key and Value never change once computed. Token 5's K and V are identical in step 2 and in step 500. So compute them once and store them.

That stored table of every past token's Keys and Values is the KV cache. With it, each new step computes K,V for only the one new token and reads the rest from the cache:

work WITHOUT a cache:  1 + 2 + 3 + ... + N  =  N(N+1)/2  ≈  N²/2     (quadratic)
work WITH    a cache:  1 + 1 + 1 + ... + 1  =  N                     (linear)

For a 1,000-token answer that's ~500× less of this work. You will measure exactly this in lab-01 — a 20-line experiment that is the single justification for the entire course. The KV cache is not an optimization you can skip; it's what makes generation tractable.

The catch — and the reason Phases 2–3 exist — is that the KV cache is enormous and grows with every token, and it lives in scarce GPU memory. Managing it well is most of vLLM.

🆕 New words: autoregressive generation (predict → append → repeat), KV cache (stored Keys/Values of all prior tokens), EOS token (the "stop" token).


0.10 Prefill vs decode, and why decode is memory-bound (the chapter's crux)

With a KV cache, generation splits into two phases with opposite performance personalities. This is the most-probed idea in LLM-inference interviews — we'll do it with real numbers.

  • Prefill — the first run: process the entire prompt at once to fill its KV cache. Many tokens, one run.
  • Decode — every run after: generate one token, append, repeat. One token, one run, many times.
PrefillDecode
tokens per runmany (whole prompt)one
limited bycompute (math throughput)memory bandwidth (reading from HBM)
sets the metricTTFT (time to first token)ITL / TPOT (time per output token)

Why decode is bottlenecked by memory speed, not math

To produce one decode token, the GPU must read every weight in the model out of its main memory (HBM) — plus the whole KV cache — and then does only one token's worth of math with all of it. It's like driving to a vast warehouse, loading every crate onto the truck, to deliver one postcard. The bottleneck is the loading (memory reads), not the delivering (math).

🔬 The arithmetic that proves it (Going deeper — this is the money slide). Take Llama-3-8B in bf16 (2 bytes/param), on an A100 (HBM bandwidth ≈ 2 TB/s, compute ≈ 312 TFLOP/s bf16).

  • Memory per decode step (batch = 1): read all weights = 8e9 params × 2 bytes = 16 GB. Time to read at 2 TB/s = 16e9 / 2e12 = 8 ms. That alone caps you at 1/0.008 ≈ 125 tokens/secno matter how fast the math is.
  • Compute per decode step: a forward pass costs ≈ 2 × params FLOPs per token = 2 × 8e9 = 16 GFLOP. At 312 TFLOP/s that's 16e9 / 312e12 ≈ 0.05 ms.
  • Verdict: memory (8 ms) dwarfs compute (0.05 ms) by ~160×. Decode is utterly memory-bound at batch 1. The expensive math units sit ~99% idle, waiting for weights to arrive.

Arithmetic intensity makes this crisp: it's FLOPs ÷ bytes-read. Decode at batch 1 ≈ 16 GFLOP / 16 GB = 1 FLOP/byte. The A100's "ridge point" (where it flips from memory- to compute-bound) is 312e12 / 2e12 ≈ 156 FLOP/byte. Since 1 ≪ 156, we're deep in memory-bound territory. This is the roofline model in one number.

The escape: batching. If you decode B sequences together, you still read the weights only once but do B× the math → intensity ≈ B FLOP/byte. To reach the ridge (≈156) and use the GPU fully, you need batch ~150. That's why throughput serving is all about big batches — and why the scheduler (Phase 3) exists. Prefill already has high intensity (many tokens × one weight read) → it's compute-bound from the start, which is why a long prompt can hog the GPU and must be chunked (Phase 3).

This one section explains nearly every optimization ahead:

  • Batching (Phase 3): amortize the weight read over many sequences → throughput.
  • Quantization (Phase 6): make the weights fewer bytes → less to read → faster decode.
  • CUDA graphs (Phase 5): when per-step math is tiny, even the CPU overhead of launching the work dominates → remove it.
  • Speculative decoding (Phase 8): do useful work for several tokens per weight-read.

🆕 New words: prefill / decode, TTFT / ITL(TPOT), HBM (the GPU's main memory), compute-bound / memory-bandwidth-bound, arithmetic intensity (FLOPs/byte), roofline (the model that says which bound you're under), ridge point (the FLOP/byte where it flips).

🔬 The GPU memory hierarchy (expert aside). A GPU has tiers: tiny ultra-fast registers and SRAM/shared memory (KB–MB, ~TB/s within a core) on-chip, and big slow HBM (tens of GB, ~1–3 TB/s) off-chip. "Memory-bound" means bound by HBM. FlashAttention (Phase 4) is fast precisely because it keeps attention's intermediates in SRAM and avoids round-tripping the giant score matrix to HBM. Keep this hierarchy in mind whenever a kernel is "memory-bound" — it's usually HBM traffic.


0.11 How big is the KV cache? (the wall that caps your users)

Decode is memory-bound, and the KV cache is the other big thing in that memory. Let's size it, one line at a time.

For every token, in every layer, we store a Key vector and a Value vector. So:

bytes_per_token = 2                  (one K + one V)
                × num_layers          (32)
                × num_kv_heads        (8   ← GQA! not 32 — see §0.6)
                × head_dim            (128)
                × bytes_per_number    (2 for bf16)

For Llama-3-8B:

2 × 32 × 8 × 128 × 2  =  131,072 bytes  ≈  128 KB   per token

So a 2,000-token conversation = 2000 × 128 KB ≈ 256 MB of GPU memory — for one user. The punchline: a 24 GB GPU, after ~16 GB for the weights, has ~8 GB for KV. At 256 MB/user that's about 30 conversations at once before you run out of memory.

This is the headline of the entire field. What caps how many people you can serve is usually memory, not compute. The KV cache fills the GPU long before the math units are busy. So the serving game is fitting more KV cache: by not wasting any (PagedAttention, Phase 2), by sharing it across requests (prefix caching, Phase 3), and by shrinking it (FP8 KV cache, Phase 6 — halving bytes_per_number doubles your users).

🔬 Going deeper — scale it to 70B and feel the squeeze. Llama-3-70B: 80 layers, 8 KV heads, head_dim 128 → 2×80×8×128×2 = 327,680 B ≈ 320 KB/token. At 8k context that's 2.6 GB per sequence. On an 80 GB A100, after ~140 GB of weights (wait — 70B in bf16 is 140 GB, so it doesn't even fit on one 80 GB GPU!). This is why 70B requires tensor parallelism across multiple GPUs (Phase 10) and why people quantize (Phase 6): both the weights and the KV cache are fighting for memory. You'll compute these numbers yourself in lab-02.

When you later see vLLM log Maximum concurrency for 2048 tokens: 68.65x, you'll know it's this exact division: free-HBM ÷ per-sequence-KV. That number is your serving capacity.


0.12 Throughput vs latency, and Little's Law

Two metrics, in tension, that you'll trade off for the rest of your career:

  • Latency — how fast one request feels (TTFT, ITL). What an individual user cares about.
  • Throughput — total tokens/sec across everyone. What sets your cost per token — the number a business lives or dies on.

They fight: bigger batches raise throughput (amortized weight reads, §0.10) but slow each individual request (more work per step). The scheduler (Phase 3) steers this; Phase 18 is the art of tuning it.

🔬 Little's Law (Going deeper). For any stable serving system: concurrency = throughput × latency. If each request stays in the system for L seconds and you sustain X requests/sec, then on average N = X·L requests are in flight. Rearranged: to hit a target throughput at a given latency, you need a certain concurrency — and that concurrency must fit in KV memory (§0.11). This little equation ties together the whole stack: memory limits concurrency, concurrency (via Little's Law) limits the throughput you can reach at your latency SLA. You'll use it to size real deployments in Phase 18.

🆕 New words: latency, throughput, cost per token, Little's Law (N = X·L), SLA (the latency you've promised customers).


0.13 The one picture to carry into every later phase

Strip away the words and the engine reduces to this: a request is a list of tokens with two counters racing.

  ┌────────────────────────────────────────────────────────────────────┐
  │  A request = tokens + two counters:                                 │
  │     num_tokens          = how many tokens exist (prompt + generated) │
  │     num_computed_tokens = how many have been processed (KV cached)   │
  │                                                                     │
  │  PREFILL : computed is far behind  → catch up in one big run         │
  │  DECODE  : computed is one behind  → compute one more, append, repeat│
  │                                                                     │
  │  The engine's entire job: make `computed` catch up to `tokens`,     │
  │  as cheaply as possible, for thousands of requests at once.         │
  └────────────────────────────────────────────────────────────────────┘

This is literally how vLLM's Request object is built (vllm/v1/request.py) and how its scheduler reasons (Phase 3); mini_vllm/request.py mirrors it. If you remember one diagram from the whole course, make it this one — every later phase is "do one part of this loop better."


0.14 What you'll do in this phase

  • Read: 01-deep-dive.md — find every concept above (Q/K/V, the cache, the two counters) in a real model file and in vLLM's EngineCore.step.
  • Build / measure: 02-mini-build.md — understand mini_vllm's tokenizer, toy model, and sampler, and run the two experiments below.
  • Labs (see labs/README.md for the full guide to each):
    • lab-01-kv-cache-speedup [CPU-OK] — implement generation with and without a KV cache and measure the O(N²) → O(N) win. The motivating experiment of the course.
    • lab-02-kv-memory-calculator [CPU-OK] — write the memory formula and compute how many users fit on a real GPU (8B and 70B). See the memory wall for yourself.
    • lab-03-sampling-basics [CPU-OK] — build greedy/temperature/top-k/top-p from scratch and prove your sampler agrees token-for-token with mini_vllm's.
    • lab-04-prefill-vs-decode [CPU-OK] — the roofline arithmetic: the ridge point, the 0.6% compute utilization of single-stream decode, the 125 tok/s speed limit, the critical batch size.
  • Test yourself: EXERCISES.md, INTERVIEW.md, CHEATSHEET.md.

You're ready for the rest of the book when you can, from memory: walk the attention worked example; explain why decode is memory-bound using arithmetic intensity; derive the KV-cache size for a model; and estimate how many users a GPU can serve. Those four are the foundation everything else is built on.

Course home · Phase 01