Published on

⭐Why Your First Token Is Always Late

Authors
  • avatar
    Name
    Anirudh Sathiya
    Twitter

[ Readtime: 11 mins ]

If you've ever sat in a systems software engineer interview, you've probably been asked "Walk me through what happens when you type google.com and hit enter."

You answer: DNS resolution, TCP handshake, HTTP request, load balancer, index lookup, ranked results. You could probably draw it on a whiteboard in under a minute with a coffee in your other hand.

Now try this interview question: walk me through what happens between you asking ChatGPT "explain quantum computing like I'm five" and it responding with "great question!"

Alt text

Most engineers I've talked to like to jump straight to model weights and matrix multiplications. Which is fair, but there's a lot that's unspoken about the inference pipeline. Why does generating a token cost 2-3x more than reading one? Why does doubling your context window do worse than double the damage to your bill? Why is the first token always the slowest?

I built a full inference server, BPE tokenizer and transformer from scratch in C++ to answer these. This blog is part 2 of a 3-part series. Part 1 covers tokenization if you want the buildup, but you don't need it. By now, your prompt is a list of token IDs. Let's talk about what happens next.

Forward Pass at a Glance

Before the fun stuff, here's the full forward pass pipeline for reference. I have illustrated each stage with a summary and an example.

Alt text

If this is new to you, 3Blue1Brown and Jay Alammar's walkthroughs are the best place to start. This blog assumes you get the gist and focuses on the inference aspect of LLMs.

Why is the first token so slow to generate? But not the rest?

Have you noticed that when you prompt ChatGPT, there's a short pause and suddenly the tokens start streaming? The reason why takes us to the heart of inference.

Prefill Phase

Alt text

Context changes meaning. "Bananas" in "I love bananas" means something very different from what it means in our input prompt. Before predicting the next token, the model needs to resolve this. So every token attends to each other, weighing how much each one matters. That's where the term "attention" comes from.

Attention gets computed using key and value vectors. Each token gets projected into three vectors: a Query (what am I looking for?), a Key (what do I contain?), and a Value (what do I output?).

This is the famous equation from "Attention is all you need".

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V , dkd_k = dimension of key vector

This is O(N2)O(N^2) since QKTQK^T requires computing N2N^2 token pair interactions.

Hence, prefill is the more compute intensive phase.

That's why you need to wait so long for the first token!

KV Cache

To avoid recomputing the attention score matrix for every new token, we cache the computed K,VK, V tensors in the KV Cache.

Decode Phase

By the end of prefill, the last token isn't just "bananas": it encodes the context of the entire prompt. Now, the model enters an autoregressive loop called the decode phase. This stage generates one new token at a time.

To predict the next token, we multiply the current token's Query against the cached Keys and Values. Since those KVs are already computed, each decode step is O(N)O(N) instead of O(N2)O(N^2).

You'd think that would make it N times faster than the prefill phase. But the catch is that for every single token, the GPU has to read through the entire KV Cache from memory. For frontier models, this could reach up to 100GB! This makes decode memory bandwidth bound.

Alt text

Why does context length murder your bills?

When you paste a long document into ChatGPT and ask it to summarize, the model runs attention across every pair of tokens. So double the document, quadruple the compute. But compute isn't the only cost that explodes. A 1M context window means the KV Cache has to store up to one million token positions.

Rough napkin math for Llama 3.1 model, which has 128 layers, hidden dimension of 16384, max context of 128K tokens, bf16 (2 byte value)

KV Cache for:

  • one token: 2 bytes * 16384 hidden dimensions * 2 (for Key + Value vectors) = 64KB
  • all 128 layers = 64 KB * 128 ~=8MB (assuming hypothetical MHA)
  • max context of 128K tokens = 8 MB * 128000 =~ 1 TB per request

Holy sheep.

Let's look at the stock price of the largest High Bandwidth Memory manufacturer.

Alt text

4X over the last year? Checks out :')

Group Query Attention

That 1TB KV Cache size number assumes every attention head gets its own Key and Value vectors. In practice, Llama 3.1 uses Grouped-Query Attention; instead of 128 separate KV heads, groups of 16 query heads share the same KV pair. That brings our 1TB down to ~64GB per request. Still enormous, but now it actually fits on hardware that exists.

Okay how do we scale from one user to a million?

Continuous Batching

Microsoft's 2022 paper introduced this optimization to better streamline inference. Production servers batch multiple users' requests into a single forward pass. This is done to maximize GPU utilization.

Alt text

After tokenization, the tokens are concatenated into one flat vector. This is combined with position IDs to tell the system which tokens belong together.

Alt text

Since the transformer will try to "attend" all these tokens as if they came from one request, we need to isolate them with an attention mask. This is done as visualized.

Alt text

For you acute (yes, you're cute too) readers out there, you might notice that we're still mixing decode requests with slower prefill requests. Here, User A's request might take longer to come because of batching it with User B's. That brings us to our next optimization.

Chunked Prefill

Chunked prefill breaks the request into multiple chunks.

Alt text

This further improves GPU utilization and improves decode throughput when combined with continuous batching. For example, chunked prefill is the reason why a 200 page document summarization prompt doesn't bottleneck the inference service.

Speculative decoding

The problem is that decoding is sequential. Each token depends on the last and takes one forward pass to execute. That's expensive.

Speculative decoding cheats by using a small, fast "draft" model to guess the next 'n' tokens.

The big model verifies all "n" tokens in one forward pass. Verification works like prefill. The big model takes the 'n' tokens as input, computes the logits at each position and checks whether the draft model's picks match. Everything up to the first mismatch gets accepted.

Alt text

The goldilocks challenge here is the draft model size. Too dumb and most tokens get rejected and you wasted compute. Too smart and you might as well just use it directly.

Google uses this in production for Gemini, the draft model is typically 10-20x smaller than the main model. They claim a 2-3x improvement in throughput!

Did you know your iphone has speculative decoding built into it? Apple's on-device 3B parameter model uses it to make part of Siri run fast locally.

Paged attention

vLLM went from a Berkeley PhD project to the industry wide serving stack in under a year because of paged attention.

Paged attention, much like virtual memory paging (did you pay attention in your OS class?), allocates KV cache in small pages on demand. This not only saves memory, but also allows more requests to fit into the GPU at the same time.

Alt text

Quantization

Meta released a model that takes 1.6 TB at full precision. That would require 20 Nvidia H100s just to store the weights.

In practice, models are "quantized" to fit in the hardware.

This is the same as video game programming optimizations

float bodyWeight = 67.5f; // who needs 32 bits to store a weight??

int8_t bodyWeight = 67; // less accurate but only takes 8 bits.

For model weights, int8 is nearly lossless. Int4 is still used heavily for local inference but it comes with the quality tradeoff.

KV Cache Quantization

The inference frontier is still moving really fast. Google published TurboQuant (ICLR 2026), which quantizes the KV cache itself down to 3 bits. In practice, this would scale down KV Cache size by a whole order of magnitude.

Final thoughts

Next time ChatGPT streams a response to you, you'll know what's happening. Tokens. Attention. KV cache growing one row at a time. Speculative decoding, chunked prefill and other inference optimizations.

Alt text

Appendix:

Tell me more!

We covered what happens between your prompt and the model's first token. But we skipped the last mile: how does the model actually pick that token? why temperature=0 doesn't mean deterministic? That's Part 3 of this series.

Show me the code!

If you want to peel back the layers yourself, you can start with WhiteLotus: an inference server that I wrote alongside this blog. Its bare bones so its very easy to tinker with and understand what a production level inference server like vLLM or llama.cpp is doing under the hood.

If you liked this human-written article, consider subscribing to my blog!

Get an email when the next blog is published! No spam.