LLM VRAM Calculator
Work out how much GPU memory you need to run a large language model locally. Pick a model (or enter parameters), choose a quantization, and set your context length — the calculator splits the total into model weights, KV cache, and runtime overhead, then tells you which common GPUs can hold it. Everything is computed in your browser from the same arithmetic llama.cpp and vLLM use; nothing is sent anywhere.
Advanced — override architecture (for KV cache)
| Model weights | |
| KV cache | |
| Overhead (~10%) | |
| Total VRAM |
Which GPUs fit?
How to use the LLM VRAM Calculator
Choose a model from the preset list — it fills in the parameter count, layer count, and KV dimension for you — or pick Custom and type the numbers in. Then set the quantization you plan to load the weights at, the context length you want to support, and how many requests you will run in parallel (batch). The four-row table updates instantly.
The first row, model weights, is just parameters × bytes-per-weight. The second, KV cache, grows with context length and batch size — this is the number most people forget, and it is why a model that "fits" at 2K context spills out of VRAM at 32K. The overhead row is a flat allowance for the CUDA context, activations, and fragmentation. The GPU list at the bottom marks each card green if the total fits with headroom, amber if it is close, and red if it will not load.
All math runs locally. Use the advanced panel to override the layer count and KV dimension when you are modelling an architecture that is not in the preset list.
How LLM VRAM is calculated
The VRAM a model needs at inference time is the sum of three things. Weights dominate for short prompts: a model with P billion parameters loaded at b bytes per weight occupies P × b gigabytes. FP16 is 2 bytes, so an 8B model is ~16 GB; a 4-bit quant is roughly a quarter of that. The GGUF k-quants are not clean fractions of a byte — their real cost is the bits-per-weight figure llama.cpp reports (for example Q4_K_M averages about 4.8 bits, or ~0.58 bytes), which is what this calculator uses.
The KV cache is the per-token memory that attention reuses so it does not recompute past keys and values. Its size is 2 × layers × kv_dim × context × batch × bytes, where kv_dim is the number of key/value heads times the head dimension. Models with grouped-query attention (most modern ones) have a small kv_dim and therefore a cheap cache; older multi-head models pay much more. Because the cache scales linearly with context length and batch size, long-context or high-concurrency serving is usually KV-bound, not weight-bound.
Finally there is overhead: the CUDA/Metal runtime context, intermediate activations, and allocator fragmentation. A flat 10% allowance is a reasonable planning figure for single-stream inference; heavy batching and very long sequences push it higher. Add the three together and compare against your card. The estimate is deliberately conservative so that a model marked "fits" really loads — but driver versions, the serving framework, and flash-attention settings all move the real number by a few percent.
Common use cases
- Buying or renting a GPU. Check whether a 24 GB card holds the model you want before you spend money, and see how much context that leaves you.
- Choosing a quantization. Compare FP16 against Q4_K_M to see exactly how much VRAM dropping to 4-bit frees up for a longer context window.
- Planning context length. Find the longest context your card can serve once the weights are loaded — the KV cache row shows the trade-off directly.
- Sizing a serving box. Multiply by batch size to estimate the VRAM a concurrent-request workload needs, then pick single-GPU versus multi-GPU.