LLM Download Size Calculator

Work out how big a model is to download before you start pulling tens of gigabytes. Enter the parameter count and pick a precision or GGUF quant (F16, Q8_0, Q6_K, Q5_K_M, Q4_K_M, Q3_K_M, Q2_K…) and get the file size in GB and GiB, using each quant's real bits-per-weight. Pair it with the VRAM calculator to check it also fits your GPU.

How to use the LLM Download Size Calculator

Enter the model's parameter count in billions (e.g. 8 for an 8B model, 70 for 70B) and choose the precision or GGUF quant you intend to download. The size is computed as parameters × bits-per-weight ÷ 8, then shown in GB (decimal, what your ISP and download manager report) and GiB (binary, what your OS shows for disk).

The bits-per-weight figures are the real measured averages for llama.cpp GGUF quants, not a flat round number, so a Q4_K_M of an 8B model lands near the ~4.9GB you actually see on Hugging Face rather than a naive 4GB.

How download size relates to quantization

A model's file size is set almost entirely by how many bits each weight uses. Full precision (F16) is 16 bits per weight, so an 8B model is about 16GB. Quantization shrinks that: Q8_0 is ~8.5 bits, Q4_K_M ~4.83 bits, Q2_K ~3.35 bits. Lower bits mean a smaller download and less VRAM, at some cost to quality — Q4_K_M is the usual sweet spot, Q8_0 is near-lossless, and Q2_K is for when you are desperate to fit a big model on small hardware.

The 'K' quants (Q4_K_M, Q5_K_M, Q6_K) are k-quants that spend slightly more bits on the most important weights, which is why their bits-per-weight are not whole numbers. Multiplying by the parameter count gives the download; add a little for the tokenizer and metadata, which are negligible at these scales.

Common use cases

  • Before you pull a model — check whether the Q4 or Q5 fits your disk and your patience.
  • Picking a quant — compare F16 vs Q8 vs Q4 sizes for the same model at a glance.
  • Capacity planning — size a disk or a cache for a set of models.
  • Bandwidth budgeting — estimate the download on a metered or slow connection.

Frequently asked questions

Why is the size bigger than parameters ÷ 2 for F16?

F16 is 2 bytes per parameter, so an 8B model is ~16GB. The decimal-GB and binary-GiB figures differ slightly because GB uses 1,000,000,000 bytes and GiB uses 1,073,741,824.

Are the bits-per-weight exact?

They are the standard measured averages for llama.cpp GGUF quants and match Hugging Face file sizes within a percent or two. Exact size varies a little by model architecture and tokenizer.

Which quant should I download?

Q4_K_M is the common quality/size sweet spot. Q5_K_M or Q6_K if you have room, Q8_0 for near-lossless, Q2_K/Q3 only to squeeze a large model onto limited hardware.

Does this include VRAM at run time?

No, this is the file/download size. Running the model also needs memory for the KV cache and context. Use the GGUF VRAM calculator for that.
Embed this tool on your site

Free to embed, no attribution required (but appreciated). Paste this where you want the tool to appear: