GGUF VRAM & Size Calculator
Pick a model size and context length to see the on-disk file size and the VRAM needed for every GGUF quant at once — from tiny Q2_K up to Q8_0 and full F16. The bits-per-weight figures are the real averages llama.cpp produces, so the file-size column lines up with what you actually download from Hugging Face. Use it to find the largest quant that fits your card. Runs entirely in your browser.
Advanced — override architecture (for KV cache)
File size = parameters × bits-per-weight. Total VRAM adds an FP16 KV cache at your context plus ~10% runtime overhead. Enter your VRAM to highlight which quants fit.
How to use the GGUF VRAM & Size Calculator
Select a model preset or type a parameter count, set the context length you want to run, and — optionally — enter your card's VRAM so the table can highlight what fits. Every row is a GGUF quant: the file size column is what you download, and the total VRAM column adds the KV cache and runtime overhead on top.
Read the table top-down: Q8_0 is nearly lossless but large, the K-quants in the middle (Q6_K, Q5_K_M, Q4_K_M) are the sweet spot most people run, and Q3_K_M / Q2_K trade quality for the ability to squeeze a bigger model onto a small card. If you entered your VRAM, fitting rows are marked green — pick the highest one that fits for the best quality your hardware allows.
GGUF quants explained
GGUF is the file format used by llama.cpp and Ollama. A "quant" is a specific way of compressing the model's weights to fewer bits. The naming looks cryptic but follows a pattern: the number is the target bits-per-weight, and the suffix describes the scheme. Q4_0 is a legacy fixed 4-bit format; the _K quants (Q4_K_M, Q5_K_M, Q6_K) are "k-quants" that allocate bits non-uniformly — keeping more precision in the layers that matter — and the _S/_M/_L suffix is the small/medium/large variant.
Because k-quants mix bit widths, their real cost is not a round number. Q4_K_M averages about 4.8 bits per weight, Q5_K_M about 5.7, Q6_K about 6.6, and Q8_0 about 8.5. This calculator uses those measured averages, so the file-size column matches the GGUF files published on Hugging Face rather than a naive bits/8 estimate.
Quality degrades slowly from Q8 down to about Q4 and then more sharply below it. For most models Q4_K_M is the standard recommendation — it roughly halves the FP16 size while keeping perplexity within a hair of the original. Q5_K_M and Q6_K are worth it if they still fit; Q3_K_M and Q2_K are for when you would otherwise not be able to run the model at all. The VRAM column here lets you make that trade-off with real numbers instead of guessing.
Common use cases
- Choosing which GGUF to download. See every quant's size and VRAM side by side so you grab the right file the first time.
- Maximising quality on fixed hardware. Enter your VRAM and pick the highest quant that still fits with your context length.
- Planning context vs. quant. Watch the total VRAM climb with context length to decide between a bigger quant and a longer window.
- Comparing model sizes. Check whether stepping up to a 14B or 32B at a lower quant beats a 8B at Q8 for your card.