KV Cache Size Calculator
Calculate how much memory a transformer's KV cache uses — the per-token store of keys and values that attention reuses so it never recomputes the past. The cache is the reason long context and high concurrency eat VRAM: it scales linearly with context length and batch size. Pick a model or enter the architecture, set your context, and see the exact memory. Computed in your browser.
| Per token | |
| Full context, 1 request | |
| Total KV cache |
How to use the KV Cache Size Calculator
Choose a model preset to fill in the layer count, KV-head count, and head dimension, or select Custom and read those three numbers from the model's config.json (num_hidden_layers, num_key_value_heads, and hidden_size / num_attention_heads for the head dimension). Then set the context length, the number of concurrent requests, and the cache precision.
The result shows the cache cost per token, for a single full-length request, and the grand total across your batch. The per-token figure is the useful one for planning: multiply it by however many tokens you expect to hold in memory. If the total is large relative to your VRAM, you can halve it with an FP8 KV cache, reduce the context window, or serve fewer requests at once.
What the KV cache is and why it grows
Each time a transformer processes a token, every attention layer computes a key and a value vector for it. Later tokens attend back to all earlier keys and values, so rather than recompute them at every step, the model stores them — that store is the KV cache. Its size is 2 × layers × kv_heads × head_dim × context × batch × bytes. The factor of two is for keys and values; everything else follows from the architecture and how much text you are holding.
The crucial property is that the cache grows linearly with context length and batch size but is fixed per token regardless of the model's parameter count. That decoupling is why a modest 8B model serving 128K-token contexts to many users at once can spend more VRAM on its cache than on its weights. It is also why throughput-oriented serving frameworks like vLLM put so much engineering into managing cache memory with techniques such as paged attention.
Two architectural choices cut the cache dramatically. Grouped-query attention (GQA) lets many query heads share a few KV heads — Llama 3.1 has 8 KV heads for its larger query-head count, shrinking the cache severalfold versus classic multi-head attention where every query head has its own. Multi-query attention (MQA) takes it further with a single KV head. Newer designs like multi-head latent attention (MLA) compress the cache further still. When you compare an old multi-head model (try the GPT-3 preset) against a modern GQA model at the same context, the difference in cache size is stark.
Common use cases
- Budgeting VRAM for long context. See exactly how much memory a 32K or 128K window adds on top of the model weights.
- Sizing a serving deployment. Multiply per-token cost by concurrent requests to size the KV memory pool for vLLM or TGI.
- Evaluating FP8 KV cache. Compare FP16 against FP8 to decide whether quantizing the cache is worth it.
- Understanding GQA. Compare a grouped-query model against a multi-head one to see why modern models serve long context so much more cheaply.