KV Cache Size Calculator

Calculate how much memory a transformer's KV cache uses — the per-token store of keys and values that attention reuses so it never recomputes the past. The cache is the reason long context and high concurrency eat VRAM: it scales linearly with context length and batch size. Pick a model or enter the architecture, set your context, and see the exact memory. Computed in your browser.

Per token
Full context, 1 request
Total KV cache

How to use the KV Cache Size Calculator

Choose a model preset to fill in the layer count, KV-head count, and head dimension, or select Custom and read those three numbers from the model's config.json (num_hidden_layers, num_key_value_heads, and hidden_size / num_attention_heads for the head dimension). Then set the context length, the number of concurrent requests, and the cache precision.

The result shows the cache cost per token, for a single full-length request, and the grand total across your batch. The per-token figure is the useful one for planning: multiply it by however many tokens you expect to hold in memory. If the total is large relative to your VRAM, you can halve it with an FP8 KV cache, reduce the context window, or serve fewer requests at once.

What the KV cache is and why it grows

Each time a transformer processes a token, every attention layer computes a key and a value vector for it. Later tokens attend back to all earlier keys and values, so rather than recompute them at every step, the model stores them — that store is the KV cache. Its size is 2 × layers × kv_heads × head_dim × context × batch × bytes. The factor of two is for keys and values; everything else follows from the architecture and how much text you are holding.

The crucial property is that the cache grows linearly with context length and batch size but is fixed per token regardless of the model's parameter count. That decoupling is why a modest 8B model serving 128K-token contexts to many users at once can spend more VRAM on its cache than on its weights. It is also why throughput-oriented serving frameworks like vLLM put so much engineering into managing cache memory with techniques such as paged attention.

Two architectural choices cut the cache dramatically. Grouped-query attention (GQA) lets many query heads share a few KV heads — Llama 3.1 has 8 KV heads for its larger query-head count, shrinking the cache severalfold versus classic multi-head attention where every query head has its own. Multi-query attention (MQA) takes it further with a single KV head. Newer designs like multi-head latent attention (MLA) compress the cache further still. When you compare an old multi-head model (try the GPT-3 preset) against a modern GQA model at the same context, the difference in cache size is stark.

Common use cases

  • Budgeting VRAM for long context. See exactly how much memory a 32K or 128K window adds on top of the model weights.
  • Sizing a serving deployment. Multiply per-token cost by concurrent requests to size the KV memory pool for vLLM or TGI.
  • Evaluating FP8 KV cache. Compare FP16 against FP8 to decide whether quantizing the cache is worth it.
  • Understanding GQA. Compare a grouped-query model against a multi-head one to see why modern models serve long context so much more cheaply.

Frequently asked questions

Where do I find KV heads and head dimension?

In the model's config.json on Hugging Face. KV heads is num_key_value_heads; the head dimension is hidden_size divided by num_attention_heads (or an explicit head_dim field). The layer count is num_hidden_layers. The presets here fill these in for popular models.

Why is the KV cache separate from the model weights?

Weights are fixed once the model loads; the KV cache is allocated per token of context per request. So total VRAM is weights plus KV cache plus overhead. This calculator covers only the cache — use it alongside a VRAM calculator for the full picture.

How does grouped-query attention reduce the cache?

In GQA many query heads share a small number of KV heads, so only those few KV heads are cached. A model with 8 KV heads instead of 64 has one-eighth the cache of the multi-head equivalent, which is why almost every modern model uses GQA for long-context efficiency.

Can I quantize the KV cache?

Yes. Many runtimes support an FP8 or INT8 KV cache, halving its memory versus FP16 with little quality loss for most workloads. Select the lower precision here to see the saving. It is independent of the precision used for the model weights.

Does batch size really multiply the cache?

Yes. Each concurrent request keeps its own cache, so two requests at the same context use twice the memory. This is why high-concurrency serving is usually limited by KV cache memory rather than by the model weights.