GGUF VRAM & Size Calculator

Pick a model size and context length to see the on-disk file size and the VRAM needed for every GGUF quant at once — from tiny Q2_K up to Q8_0 and full F16. The bits-per-weight figures are the real averages llama.cpp produces, so the file-size column lines up with what you actually download from Hugging Face. Use it to find the largest quant that fits your card. Runs entirely in your browser.

Advanced — override architecture (for KV cache)

File size = parameters × bits-per-weight. Total VRAM adds an FP16 KV cache at your context plus ~10% runtime overhead. Enter your VRAM to highlight which quants fit.

How to use the GGUF VRAM & Size Calculator

Select a model preset or type a parameter count, set the context length you want to run, and — optionally — enter your card's VRAM so the table can highlight what fits. Every row is a GGUF quant: the file size column is what you download, and the total VRAM column adds the KV cache and runtime overhead on top.

Read the table top-down: Q8_0 is nearly lossless but large, the K-quants in the middle (Q6_K, Q5_K_M, Q4_K_M) are the sweet spot most people run, and Q3_K_M / Q2_K trade quality for the ability to squeeze a bigger model onto a small card. If you entered your VRAM, fitting rows are marked green — pick the highest one that fits for the best quality your hardware allows.

GGUF quants explained

GGUF is the file format used by llama.cpp and Ollama. A "quant" is a specific way of compressing the model's weights to fewer bits. The naming looks cryptic but follows a pattern: the number is the target bits-per-weight, and the suffix describes the scheme. Q4_0 is a legacy fixed 4-bit format; the _K quants (Q4_K_M, Q5_K_M, Q6_K) are "k-quants" that allocate bits non-uniformly — keeping more precision in the layers that matter — and the _S/_M/_L suffix is the small/medium/large variant.

Because k-quants mix bit widths, their real cost is not a round number. Q4_K_M averages about 4.8 bits per weight, Q5_K_M about 5.7, Q6_K about 6.6, and Q8_0 about 8.5. This calculator uses those measured averages, so the file-size column matches the GGUF files published on Hugging Face rather than a naive bits/8 estimate.

Quality degrades slowly from Q8 down to about Q4 and then more sharply below it. For most models Q4_K_M is the standard recommendation — it roughly halves the FP16 size while keeping perplexity within a hair of the original. Q5_K_M and Q6_K are worth it if they still fit; Q3_K_M and Q2_K are for when you would otherwise not be able to run the model at all. The VRAM column here lets you make that trade-off with real numbers instead of guessing.

Common use cases

  • Choosing which GGUF to download. See every quant's size and VRAM side by side so you grab the right file the first time.
  • Maximising quality on fixed hardware. Enter your VRAM and pick the highest quant that still fits with your context length.
  • Planning context vs. quant. Watch the total VRAM climb with context length to decide between a bigger quant and a longer window.
  • Comparing model sizes. Check whether stepping up to a 14B or 32B at a lower quant beats a 8B at Q8 for your card.

Frequently asked questions

Why does Q4_K_M cost more than half of F16?

Q4_K_M is not exactly 4 bits. K-quants keep some tensors (often the attention and embedding layers) at higher precision, so the average works out to roughly 4.8 bits per weight rather than 4.0. The calculator uses these measured averages, which is why the size is a little above one-quarter of F16.

Which quant should I pick?

Q4_K_M is the usual default — close to F16 quality at roughly half the size. If Q5_K_M or Q6_K still fit your VRAM, prefer them. Drop to Q3_K_M or Q2_K only when a higher quant will not fit at all; quality falls off faster below 4 bits.

Does the file size equal the VRAM I need?

No — the file size is the weights only. At runtime you also pay for the KV cache (which grows with context length) and a runtime overhead allowance. The total VRAM column adds both, which is why it is larger than the file size.

Can I offload some layers to CPU?

Yes. llama.cpp can keep some layers in system RAM and only the rest on the GPU, trading speed for the ability to run a model that does not fully fit. If a quant is just over your VRAM, partial offload is often practical, though tokens per second will drop.

Are these sizes exact?

They are close estimates based on published bits-per-weight averages. The exact size of a specific GGUF depends on the model architecture, vocabulary size, and the quantizer settings the uploader used, so treat the numbers as accurate to within a few percent.
Embed this tool on your site

Free to embed, no attribution required (but appreciated). Paste this where you want the tool to appear: