LLM VRAM Calculator

Work out how much GPU memory you need to run a large language model locally. Pick a model (or enter parameters), choose a quantization, and set your context length — the calculator splits the total into model weights, KV cache, and runtime overhead, then tells you which common GPUs can hold it. Everything is computed in your browser from the same arithmetic llama.cpp and vLLM use; nothing is sent anywhere.

Advanced — override architecture (for KV cache)
Model weights
KV cache
Overhead (~10%)
Total VRAM

Which GPUs fit?

How to use the LLM VRAM Calculator

Choose a model from the preset list — it fills in the parameter count, layer count, and KV dimension for you — or pick Custom and type the numbers in. Then set the quantization you plan to load the weights at, the context length you want to support, and how many requests you will run in parallel (batch). The four-row table updates instantly.

The first row, model weights, is just parameters × bytes-per-weight. The second, KV cache, grows with context length and batch size — this is the number most people forget, and it is why a model that "fits" at 2K context spills out of VRAM at 32K. The overhead row is a flat allowance for the CUDA context, activations, and fragmentation. The GPU list at the bottom marks each card green if the total fits with headroom, amber if it is close, and red if it will not load.

All math runs locally. Use the advanced panel to override the layer count and KV dimension when you are modelling an architecture that is not in the preset list.

How LLM VRAM is calculated

The VRAM a model needs at inference time is the sum of three things. Weights dominate for short prompts: a model with P billion parameters loaded at b bytes per weight occupies P × b gigabytes. FP16 is 2 bytes, so an 8B model is ~16 GB; a 4-bit quant is roughly a quarter of that. The GGUF k-quants are not clean fractions of a byte — their real cost is the bits-per-weight figure llama.cpp reports (for example Q4_K_M averages about 4.8 bits, or ~0.58 bytes), which is what this calculator uses.

The KV cache is the per-token memory that attention reuses so it does not recompute past keys and values. Its size is 2 × layers × kv_dim × context × batch × bytes, where kv_dim is the number of key/value heads times the head dimension. Models with grouped-query attention (most modern ones) have a small kv_dim and therefore a cheap cache; older multi-head models pay much more. Because the cache scales linearly with context length and batch size, long-context or high-concurrency serving is usually KV-bound, not weight-bound.

Finally there is overhead: the CUDA/Metal runtime context, intermediate activations, and allocator fragmentation. A flat 10% allowance is a reasonable planning figure for single-stream inference; heavy batching and very long sequences push it higher. Add the three together and compare against your card. The estimate is deliberately conservative so that a model marked "fits" really loads — but driver versions, the serving framework, and flash-attention settings all move the real number by a few percent.

Common use cases

  • Buying or renting a GPU. Check whether a 24 GB card holds the model you want before you spend money, and see how much context that leaves you.
  • Choosing a quantization. Compare FP16 against Q4_K_M to see exactly how much VRAM dropping to 4-bit frees up for a longer context window.
  • Planning context length. Find the longest context your card can serve once the weights are loaded — the KV cache row shows the trade-off directly.
  • Sizing a serving box. Multiply by batch size to estimate the VRAM a concurrent-request workload needs, then pick single-GPU versus multi-GPU.

Frequently asked questions

Why is my real VRAM usage a bit different from this estimate?

The serving framework matters. vLLM pre-allocates a large KV cache block pool; llama.cpp allocates more lazily; flash-attention changes the activation footprint. Driver and CUDA-context overhead also vary by GPU and OS. This tool gives a conservative planning estimate — expect the real figure within roughly 10% for single-stream inference.

Does quantizing the weights also shrink the KV cache?

No. Weight quantization and KV-cache precision are independent. You can run 4-bit weights with an FP16 cache (the common default) or enable a separate FP8/INT8 KV cache to save more. Set the KV precision separately in the calculator.

What is "KV dimension" in the advanced panel?

It is the number of key/value heads multiplied by the head dimension. For Llama 3.1 8B that is 8 KV heads x 128 = 1024. Models with grouped-query attention (GQA) have far fewer KV heads than attention heads, which is why their cache is small.

Can I split a model across two GPUs?

Yes — tensor or pipeline parallelism lets you sum the VRAM of multiple cards, minus some duplication and communication overhead. If the total here exceeds one card but fits the combined memory of two, multi-GPU or weight offloading to system RAM is the usual path.

Why does long context cost so much memory?

The KV cache grows linearly with context length. Doubling the context doubles the cache. At very long contexts the cache can exceed the weights themselves, which is why 128K-context serving needs far more VRAM than the model size alone suggests.