GGUF Quantization Explained: Q4_K_M vs Q5_K_M vs Q8_0 — Which to Pick

You open a model's GGUF page and there are fifteen files. Q2_K, Q3_K_S, Q3_K_M, Q4_0, Q4_K_S, Q4_K_M, Q5_K_M, Q6_K, Q8_0, and a few more. No README tells you which to click. The file sizes span roughly 6x from smallest to largest, and somewhere in that range is the one that fits your GPU and still answers well.

The suffixes are not random. Once you can read them, the choice stops being a gamble. This guide decodes the GGUF naming scheme, explains the real quality-versus-size tradeoff, and then hands the sizing to the GGUF VRAM Calculator so you pick a quant by how much memory you have left over.

The short version: for most people on a single consumer GPU, Q4_K_M is the floor for "good," Q5_K_M or Q6_K if they fit with room to spare, Q8_0 when VRAM is abundant. The reason that ladder works is in the suffix, so start there.

Reading the suffix: digit, K, and the size letter

A name like Q4_K_M has three parts, and each one tells you something different.

  • The digit (Q4, Q5, Q8) is the nominal bit-width per weight. It is a floor, not the real number. A Q4 file does not weigh exactly 4 bits per weight; it lands higher once you account for the mix and the bookkeeping.
  • The _K marks a K-quant. These store weights in super-blocks of 256 values, with sub-block scale and minimum factors that are themselves quantized. The effect is more accuracy per bit than the older block formats. When you see _K, you are looking at the modern method.
  • The size letter (_S, _M, _L) picks how aggressively the file mixes precisions. _S (small) keeps things uniform. _M (medium) and _L (large) promote the tensors that hurt quality most when squashed, typically part of the attention value projection (attention.wv) and the feed-forward down projection (ffn_down), to a higher-bit K-quant while leaving the rest cheap.

So Q4_K_M and Q4_K_S are both 4-bit K-quants, but the M spends extra bits on the sensitive tensors. That is why _M files are slightly bigger and noticeably better behaved than _S at the same digit. The size letter is the lever most people overlook.

Legacy _0 / _1 versus K-quants

Some files have no _K at all: Q4_0, Q4_1, Q8_0. These are the original llama.cpp quant formats. They use a single per-block scale (the _0 variants) or a scale plus an offset (the _1 variants), with no per-tensor mixing. One affine map per block is simple and fast to decode, but it cannot model skewed weight distributions as well as a K-quant.

That shows up at 4-bit. By the calculator's constants, Q4_K_M stores about 0.58 bytes per weight and Q4_0 about 0.57, practically the same size. The K-quant is more accurate for those bits because of its super-block structure and selective mixing. Same footprint, better output. That is why _0 and _1 are considered legacy: at 4-bit they are outclassed for almost no size saving.

The one legacy format still worth knowing is Q8_0. There is no Q8_K_M. Q8_0 is the 8-bit quant. At ~1.06 bytes per weight it is so close to full precision that the K-quant machinery buys almost nothing, so the simple format stays. If you have the VRAM, Q8_0 is the practically-lossless choice.

The digit is a floor: real bits per weight

The number in the name undersells the actual size. The mix promotes some tensors, and the quantized scales add overhead, so the effective bits per weight always run above the digit. Using the calculator's bytes-per-weight constants (multiply by 8 for bits), here is roughly where the common quants land:

Quant~Bytes/weight~Bits/weightSize vs FP16
F162.0016100%
Q8_01.06~8.5~53%
Q6_K0.82~6.6~41%
Q5_K_M0.73~5.8~37%
Q4_K_M0.58~4.6~29%
Q4_00.57~4.6~29%
Q3_K_M0.46~3.7~23%
Q2_K0.35~2.8~18%

These are estimates. The exact figure shifts with architecture, vocabulary size, and which tensors the uploader promoted, and the smallest quants vary most. llama.cpp's own published numbers for Llama-3.1-8B run a touch higher than the table above, for example Q4_K_M near 4.9 bits and Q2_K above 3 bits rather than 2.8, so treat the column as a neighborhood, not an address. The lesson holds either way: Q2_K is well above 2 bits and Q4_K_M is closer to 5 than 4. Q4_K_M at roughly 29% of full precision is why it is the default, about a quarter of the FP16 size with quality most tasks cannot distinguish.

What you actually lose as you go smaller

Quality degrades as the bits drop, but not evenly. The first cut from FP16 down to Q8_0 is essentially free for inference. Q8_0 to Q6_K to Q5_K_M stays very hard to tell apart on normal work. Q4_K_M is the point where most people stop noticing differences but the file is small enough to fit comfortably.

Below 4-bit, the curve steepens. Q3_K_M starts to slip on reasoning and instruction-following; Q2_K slips harder and can get noticeably worse at code and structured output. The degradation also depends on model size: a 70B model at Q3 often still beats a 7B model at Q8, because there is so much more redundancy to spare. A small model has less slack, so a 3B at Q2_K can fall apart in ways a 70B at Q2_K does not.

That gives a clean rule: spend your bits on the larger model first, then on a higher quant. If you can run a 13B at Q4_K_M or an 8B at Q6_K in the same memory budget, the 13B usually wins.

Pick by VRAM headroom, not by name

The mistake is choosing a quant from the file size alone. Weights are only part of what lands in VRAM. The KV cache grows with your context length and can rival or exceed the weights on long prompts. A quant whose file fits your card can still run out of memory once you fill the context. So the right question is not whether the file fits, but whether the file plus the context you will actually use fits, with room left over.

That is what the GGUF VRAM Calculator is for. Enter the model's parameter count, your context length, and your card's VRAM. It computes the file size for every quant, adds an FP16 KV cache and about 10% overhead, and labels each row Fits, Tight, or leaves it unmarked when it will not fit. Picking a quant becomes a one-line decision: take the largest quant that shows Fits at the context length you plan to run.

A worked procedure:

  1. Set the context to what you will actually use. If you load 32K-token documents, size for 32K, not 4K.
  2. Read the table top-down. The first quant that shows Fits with headroom is your ceiling.
  3. If only Tight shows up, you are one bad allocation from an out-of-memory error mid-generation. Drop one quant level, or shorten the context.
  4. If even Q4_K_M will not fit, you need a smaller model, a bigger card, or a lower context. Going below Q4 on a small model is the last resort, not the first.

Because the calculator runs the math live, you size against your hardware and your context instead of trusting a hard-coded gigabyte figure from a blog.

A default ladder that works

If you want a starting point before you open the calculator, this is the order to reach for on a single consumer GPU:

  • Q4_K_M — the floor for "good." Smallest quant that still feels like the model. Start here.
  • Q5_K_M or Q6_K — if they show Fits with headroom for your context. A small, mostly-free quality bump over Q4_K_M.
  • Q8_0 — when VRAM is abundant and you want as close to lossless as quantization gets. Returns diminish above this; FP16 is rarely worth double the memory for inference.
  • Q3_K_M and below — only on large models, or when nothing else fits and you accept the quality hit. Avoid Q2_K on small models.

Two footnotes. Smaller files move less data per token, which can help throughput, but dequantization adds work and the net effect on tokens per second depends on your hardware, so measure it with the tokens-per-second calculator rather than assuming smaller is faster. And if you are deciding whether to run locally at all versus an API, the self-hosting vs API calculator is a better place to start than agonizing over quant levels.

Use the tool

Skip the manual work. The companion tool runs this in your browser, with nothing uploaded.

GGUF VRAM Calculator

Frequently asked questions

What does the M in Q4_K_M actually stand for?

It marks the medium tensor mix. A Q4_K_M file keeps most weights at 4-bit but promotes the quality-sensitive tensors, part of the attention value projection (attention.wv) and the feed-forward down projection (ffn_down), to a higher-bit K-quant (Q6_K). The _S (small) variant skips most of that promotion, which is why Q4_K_M is slightly larger and noticeably better than Q4_K_S at the same digit. The letter describes the mix, not the base bit-width.

Is Q4_K_M or Q5_K_M better?

Q5_K_M is more accurate; Q4_K_M is smaller. By the calculator's constants Q5_K_M is roughly 0.73 bytes per weight versus 0.58 for Q4_K_M, so it costs about 25% more memory for a modest quality gain. If Q5_K_M fits your card with headroom for the context you run, take it. If it only shows Tight in the GGUF VRAM Calculator, stay on Q4_K_M. The safety margin is worth more than the small bump.

Is there a Q8_K_M?

No. Q8_0 is the 8-bit quant, full stop. The K-quant mixing scheme exists to squeeze accuracy out of low bit-widths, and at 8 bits the file is already so close to full precision that mixing buys almost nothing. So 8-bit stays in the simple legacy format. If you see a Q8 file, it is Q8_0.

Should I ever use the legacy Q4_0 instead of Q4_K_M?

Rarely. Q4_0 and Q4_K_M are nearly the same size, about 0.57 versus 0.58 bytes per weight, but the K-quant is more accurate for those bits thanks to its super-block structure and selective mixing. Same footprint, better output. Reach for Q4_0 only if a specific runtime or hardware path does not support K-quants, which is uncommon on modern llama.cpp builds.

Does a smaller quant make inference faster?

Sometimes, but it is not guaranteed. A smaller file moves less data from memory per token, which helps on bandwidth-bound setups. But dequantizing the weights adds compute, and the net effect on tokens per second swings with your GPU, the quant format, and the runtime. Do not assume smaller is faster; measure it with the tokens-per-second calculator on your own hardware.

What about the imatrix quants I keep seeing?

An importance matrix (imatrix) is a separate calibration step, not something the Q4_K_M-style suffix encodes. It runs the model over sample data to learn which weights matter most, then steers quantization to preserve them, typically cutting perplexity meaningfully at low bit-widths. It is most associated with the IQ-family (IQ2, IQ3) but can be applied alongside K-quants too. Treat it as an orthogonal accuracy boost, not a different bit-width.

Related tools