Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Phase 18 — Interview Questions: Performance Engineering

Staff/principal-level questions on this topic. Cover the answer, attempt it OUT LOUD, then compare. (See CAREER.md for how to run a full mock loop.)

Q1. Throughput is low and GPU utilization is ~30% at batch size 1–2. What's happening?

Model answer

Almost certainly CPU-launch-bound decode: many tiny kernels per step, CPU can't feed the GPU. Enable CUDA graphs, increase batch size (raise max_num_seqs / accept more concurrency), and check for Python overhead on the hot path. Confirm with a profile showing gaps between kernels.

Q2. How do you decide max_num_batched_tokens and gpu_memory_utilization?

Model answer

max_num_batched_tokens trades prefill chunk size vs decode latency: bigger = better prefill throughput but can stall decodes; tune to your prompt/output mix. gpu_memory_utilization sets how much HBM the KV cache may use — raise it to fit more concurrent sequences, but leave headroom for activations/CUDA-graph buffers to avoid OOM.

Going deeper

The flagship phases (02, 03) show the depth and number of questions to expect for a topic you claim as your specialty.