KV Cache and Context Length: Why a Bigger Context Window Eats Your VRAM
You sized the GPU for the weights. The model loaded with room to spare. Then you turned on real traffic, raised the context limit, and the server started rejecting sequences or OOMing under load. The weights never moved. The KV cache did.
This is the second VRAM axis that weight-sizing guides skip. A model's parameters are a fixed cost paid once at load. The KV cache is a variable cost paid per token, per concurrent request, every time someone sends a long prompt. On a busy server it routinely outgrows the weights, and it is usually the thing that decides how many users you can serve at once.
If your question is "will this model fit on my card for personal use," the local-fit side (GGUF quants and single-user math) is covered in How much VRAM do you need to run Llama 3 or Gemma locally. This guide is the other side of that coin: the cache itself, the grouped-query-attention math that sets its size, and what happens when you multiply it by concurrency. The KV Cache Calculator computes every number below in your browser.
The cache is per token, the weights are not
Every time a transformer reads a token, each attention layer produces a key vector and a value vector for it. Future tokens attend back to those, so the model stores them instead of recomputing the whole history on every step. That store is the KV cache.
Weights and cache scale on different axes. Weights track parameter count and are allocated once. The cache tracks how much text you are holding and how many requests you are holding it for. Crucially, the cache does not scale with parameter count directly: it depends only on layers, KV heads, and head dimension, so two models with the same values for those three would carry the same cache regardless of total parameters. Total memory is two independent budgets stacked together:
total VRAM = weights (fixed) + KV cache (grows with context x concurrency) + overheadA weight-only estimate tells you whether the model loads. It tells you nothing about whether it survives a 100K-token prompt or thirty simultaneous users. Those are KV cache questions, and they have their own formula.
The formula, written so GQA is visible
Most write-ups fold the per-token cost into one kv_dim term, which hides the architectural lever that matters most. The calculator and this guide keep it expanded:
KV bytes = 2 x layers x kv_heads x head_dim x context x batch x bytes_per_elementEach factor:
- 2: one slot for keys, one for values.
- layers: every decoder layer keeps its own cache. From
num_hidden_layersin the model'sconfig.json. - kv_heads: the number of key/value heads,
num_key_value_heads. This is not the query-head count, and the gap between them is where grouped-query attention does its work. - head_dim: size of each head, usually
hidden_size / num_attention_heads. - context: tokens currently held. Linear: double the prompt, double the cache.
- batch: concurrent sequences. Also linear, and the one people forget.
- bytes_per_element: 2 for FP16/BF16, 1 for an 8-bit (FP8 or INT8) cache.
Notice what is absent: parameter count. The cache reaches a model's size only through the layer, head, and dimension numbers that model happens to have.
The per-token number is the one to plan around
Strip out context and batch and you get a constant for each model: bytes of cache per token. Multiply by whatever you are actually holding. For Llama 3.1 8B at FP16, using the calculator's preset of 32 layers, 8 KV heads, 128 head dim:
2 x 32 x 8 x 128 x 2 = 131,072 bytes = 128 KiB per tokenThat 128 KiB/token is the planning unit. A single 8K request holds 128 KiB x 8192 = exactly 1 GiB of cache. Hold the same per-token figure against your real numbers:
| Model (calculator preset) | layers / kv_heads / head_dim | FP16 per token | One 8K request |
|---|---|---|---|
| Llama 3.1 8B | 32 / 8 / 128 | 128 KiB | ~1.0 GiB |
| Qwen 2.5 7B | 28 / 4 / 128 | 56 KiB | ~448 MiB |
| Llama 3.1 70B | 80 / 8 / 128 | 320 KiB | ~2.5 GiB |
Qwen 2.5 7B carries less than half Llama's per-token cache despite being roughly the same weight class, because it uses 4 KV heads instead of 8. That is an architecture decision showing up directly in your serving budget, invisible from the parameter count alone. The 70B's larger per-token figure, in turn, comes from its 80 layers, not its parameter total.
Concurrency is the axis weight guides skip
Context length gets the attention because it is the dramatic number. But on a production server the multiplier that actually bites is batch: every concurrent sequence keeps its own private cache.
Take Llama 3.1 8B again, 1 GiB of cache per 8K request. Now serve thirty-two users, each with an 8K context:
1 GiB x 32 = 32 GiB of KV cacheThe weights at FP16 are about 16 GB. The cache for that traffic is twice the weights, and you have not touched context length, you raised concurrency. Set batch to 32 in the KV Cache Calculator and watch the total cross your card's capacity. This is why high-throughput serving is almost always KV-bound, not weight-bound: the weights are a one-time deposit, the cache is rent you pay per active conversation.
It also reframes the throughput ceiling. Your maximum concurrent sequences is roughly (VRAM - weights - overhead) / per-request cache. Want more users? You are trading against context length and KV precision, not buying a bigger model.
GQA, MQA, and why modern models serve long context cheaply
The kv_heads term is the lever, and it explains why two 8B models of the same shape can have wildly different serving costs.
Multi-head attention (MHA), the original design, gives every query head its own key/value head. Grouped-query attention (GQA) lets many query heads share a small pool of KV heads. Multi-query attention (MQA) takes it to one shared KV head. Fewer KV heads, smaller cache, same formula.
Make the saving concrete with the calculator. Llama 3.1 8B uses 8 KV heads. Set kv_heads to 32 to simulate what full MHA would cost at the same shape:
GQA (8 KV heads): 128 KiB per token -> 1 GiB at 8K
MHA (32 KV heads): 512 KiB per token -> 4 GiB at 8KFour times the cache for the same model, the same context, the same quality of weights. Or load the calculator's GPT-3 175B (MHA) preset, 96 layers and 96 KV heads, to see a pre-GQA architecture's cache balloon next to a modern one. Multi-head latent attention compresses the cache further still by storing a low-rank projection rather than full key/value tensors; the principle is the same, attack the per-token bytes.
When you are choosing a model to serve, put the KV-head count next to the benchmark scores. A model with fewer KV heads lets you hold more context or more users in the same VRAM.
Sizing a real deployment
Turn the per-token number into a serving plan in four steps.
- Get the per-token cost. Pick the model preset in the calculator or read
num_hidden_layers,num_key_value_heads, and the head dimension fromconfig.json. - Decide your KV budget. Subtract weights and ~10% overhead from your card. On a 24 GB card running Llama 3.1 8B at FP16 (~16 GB weights), that leaves roughly 6 GB for cache after overhead.
- Divide. 6 GB / 1 GiB-per-8K-request is about six concurrent 8K sequences, or fewer at longer context. That is your concurrency ceiling.
- Tune the levers if it is tight. An FP8 KV cache roughly halves the bytes per token. A shorter served context length scales linearly. A model with fewer KV heads changes the constant itself.
If you run vLLM, this is exactly what --gpu-memory-utilization and --max-model-len govern: the fraction of VRAM left for the KV pool after weights, and the longest sequence that pool must accommodate. Paged attention lets that pool be shared in fixed-size blocks instead of pre-reserving a contiguous slab per sequence, which cuts the fragmentation waste, but it does not change the underlying bytes the formula gives you. The vLLM serve command generator wires these flags together, and the LLM VRAM Calculator adds weights and overhead on top for the full picture.
The mental model to keep
- Weights are a fixed deposit. The KV cache is rent, paid per token and per concurrent request.
- The cache is linear in both context and batch, and independent of parameter count except through layers, KV heads, and head dimension.
- Fewer KV heads (GQA, MQA) is the architectural discount: 1 GiB versus 4 GiB for the same 8B model at 8K.
- Concurrency, not context, is usually what caps a serving deployment. Divide your spare VRAM by the per-request cache to find your real user ceiling.
- The fast levers when memory is tight: FP8 KV cache (roughly half), shorter served context (linear), or a model with fewer KV heads.
Run your own numbers in the KV Cache Calculator: pick a preset, set the context and batch you actually expect, and toggle precision and KV heads to see each lever move the total. It all computes locally, so nothing you type leaves the page.
Skip the manual work. The companion tool runs this in your browser, with nothing uploaded.
KV Cache CalculatorFrequently asked questions
Why does the model load fine but fail once traffic ramps up?
The weights are allocated once at load and never grow. The KV cache is allocated per token per concurrent request, so it can balloon under load even though nothing about the weights changed. A deployment that loads comfortably can still run out of memory at thirty simultaneous long-context sessions because each one carries its own cache.
How much does grouped-query attention actually save versus multi-head attention?
It scales with the ratio of query heads to KV heads. Llama 3.1 8B uses 8 KV heads where a full multi-head version of the same shape would use 32, so its cache is one-quarter the size at any context. Set kv_heads to 8 versus 32 in the KV Cache Calculator to see the exact bytes for your model.
Is context length or concurrency the bigger memory risk for a server?
Both are linear, but concurrency is the one people under-plan. A single 128K prompt is large, yet a steady stream of moderate 8K requests from many users can dwarf it because each request keeps a full cache. Compute per-request cache, then multiply by your expected concurrent sequences to find which dominates your workload.
How do I turn the cache number into a vLLM configuration?
Your KV pool is roughly card VRAM minus weights minus overhead, which is what --gpu-memory-utilization controls as a fraction. --max-model-len sets the longest sequence that pool must hold. Divide the pool by per-request cache at that length to estimate concurrent sequences. Paged attention reduces fragmentation in that pool but does not change the per-token bytes. Note that vLLM's quantized KV cache is FP8 (E4M3 or E5M2), not INT8.
Does an FP8 KV cache hurt output quality?
For most workloads the impact is small, on the order of one to two points on accuracy benchmarks, and it roughly halves cache memory because it stores one byte per element instead of two. It is independent of the precision used for the weights, so you can keep BF16 weights and an FP8 cache. Toggle precision in the calculator to see the saving before committing.
Why is the KV cache not simply proportional to parameter count?
The cache depends on layers, KV heads, and head dimension, not on the parameter total. Llama 3.1 70B has more layers than the 8B (80 versus 32), so its per-token cache is larger, but that gap comes from the layer count, not the parameter count. Two models with the same layer, KV-head, and head-dim numbers would carry the same cache regardless of total parameters.