Which LLM Can You Actually Run on Your GPU? A VRAM-Tier Fit Guide

Most "how much VRAM does this model need" guides make you start from the model. You pick Llama or Qwen, do the math, and then find out your card is too small. That is backwards from how anyone actually shops. You already own a GPU. The real question is the reverse: given this card, what is the biggest, smartest model I can realistically run?

This guide answers it tier by tier, from 8 GB consumer cards up to an 80 GB data-center accelerator. For each tier you get a headline verdict, the models that fit with room to spare, the ones that scrape in, and the ones you should stop trying to force. Every number here comes from the same formula the GPU Requirements Checker uses, so the prose and the tool agree. When you want an exact verdict for a model, quant, and context length the table does not cover, the checker gives it instantly in your browser.

The one piece of math you need first

Three things eat your VRAM: the model weights, the KV cache (attention's memory of past tokens), and a fixed runtime overhead. Weights dominate, so for a GPU-first lookup they are what you compute first.

Here is a quick estimate for a 4-bit model: a little over half a gigabyte per billion parameters. That is enough to rank cards against models in your head. For the real verdict, the checker uses the precise weight size at Q4_K_M (about 0.58 bytes per weight), plus the KV cache for your context, plus roughly 10% overhead, then rounds up. The per-model weights at Q4_K_M land where you would expect:

weights at Q4_K_M (binary GB, as your GPU reports them)

8B   ~ 4.3 GB
14B  ~ 7.9 GB
32B  ~ 17.6 GB
70B  ~ 37.8 GB
405B ~ 219 GB

Those are binary gigabytes, which is what your GPU actually reports, so they run a touch under the napkin estimate. Add the KV cache for your context and the totals climb. A 7B-to-9B model at Q4 with an 8K context lands near 5 to 8.5 GB total. A 70B at Q4 with the same context lands near 44 GB. Those totals, not the raw weights, decide the verdict.

One rule makes this usable: leave headroom. A model whose total equals your card's memory will technically load and then choke the moment context grows. Treat a model as a comfortable fit only if it needs noticeably less than your VRAM. The checker requires about 12% slack before it calls something a clean fit.

8 GB cards (RTX 3070, 4060): the 7B-to-8B tier

Verdict: one good 7B to 8B model at 4-bit, short to medium context. That is the ceiling, and it is a fine ceiling.

An 8 GB card is the entry point for serious local inference, and 7B-class models were practically designed for it. At Q4_K_M with an 8K context, Llama 3.1 8B needs about 5.9 GB, Mistral 7B about 5.4 GB, and Qwen 2.5 7B about 5.0 GB. All three load with a couple of gigabytes to spare for a longer prompt or a second app on the GPU.

  • Fits comfortably: Llama 3.1 8B, Mistral 7B, Qwen 2.5 7B at Q4, 8K context.
  • Fits only if you trim: Gemma 2 9B. It has more layers and a larger attention head dimension than Llama 3.1 8B, so its KV cache grows faster as context climbs. At Q4 with a modest context it fits 8 GB; push the context long and it can spill over. Drop to a shorter context or a leaner quant and it slips back in.
  • Won't fit: anything 13B and up at a usable quant. A 14B at Q4 wants roughly 8 GB of weights before you add any context.

Gemma 2 9B is the instructive case on this tier. On parameter count it looks like a near-twin of Llama 3.1 8B, but the two diverge once context grows, so the verdict depends on the context length you set, not the parameter count alone.

12 GB cards (RTX 3060, 4070): the comfortable 14B tier

Verdict: every 7B-to-9B model loads with room, and 14B-class models become genuinely usable.

Twelve gigabytes is the sweet spot a lot of people land on, and it unlocks the next quality step. Qwen 2.5 14B at Q4 with an 8K context needs about 10.4 GB, which fits with a little headroom. Gemma 2 9B, tight on 8 GB at long context, now sits comfortably. The whole 7B-to-9B field runs without you thinking about it.

  • Fits comfortably: all 7B to 9B models, plus 14B-class models at 4-bit, 8K context.
  • Fits with care: 14B at longer context. As context grows past 8K the KV cache climbs and you lose the slack, so watch the total.
  • Won't fit: 32B and up. Qwen 2.5 32B at Q4 needs about 21.5 GB, well past 12 GB.

If you are buying a card mainly to run local models and your budget tops out here, 12 GB buys the jump from 7B to 14B. That is a real intelligence upgrade, not just a bigger number.

16 GB cards (RTX 4080, 4060 Ti 16GB): more headroom, same class

Verdict: the same 14B class as 12 GB, but with breathing room for long context and bigger batches. It does not reach 32B at a usable quant.

This tier is about comfort more than a new model class. A 14B at Q4 that was a careful fit on 12 GB now has several gigabytes to spare, so you can push context to 32K or run a longer document without an out-of-memory crash mid-generation.

  • Fits comfortably: all 7B to 14B models at 4-bit with generous context.
  • Worth trying: a 14B at higher precision (Q5 or Q8) if you want more quality and can accept shorter context. Q8 roughly doubles the weight memory versus Q4.
  • Won't fit: Gemma 2 27B (about 19.3 GB at 8K) and Qwen 2.5 32B (about 21.5 GB). Both sit just over the line, which is the most frustrating place to be. The fix is a 24 GB card, a smaller quant, or shorter context.

If you keep landing on "so close" with 27B-class models, run them through the checker at a lower quant before you give up. Dropping from Q4_K_M to a leaner 4-bit format can shave off the gigabyte or two you are short.

24 GB cards (RTX 3090, 4090): the 27B-to-32B tier

Verdict: 27B runs cleanly, 32B scrapes in, and a 70B will not fit no matter how you squint.

Twenty-four gigabytes is where local inference stops feeling cramped. Gemma 2 27B at Q4 with an 8K context needs about 19.3 GB and fits with room. Qwen 2.5 32B at the same settings needs about 21.5 GB, which loads but with almost no slack, so trim context or quant if you want safety.

Model (Q4, 8K ctx)Approx. totalVerdict on 24 GB
Gemma 2 27B~19.3 GBFits
Qwen 2.5 32B~21.5 GBTight
Mixtral 8x7B~28.8 GBWon't fit
Llama 3.1 70B~44.3 GBWon't fit

Two things surprise people here. First, Mixtral 8x7B is a mixture-of-experts model: it activates only a couple of experts per token (about 12.9B active parameters), but it still stores all 46.7B parameters, so at Q4 it needs roughly 28.8 GB and does not fit a single 24 GB card. Active parameters drive speed; total parameters drive memory. Second, a 70B does not fit a 4090. At Q4 it wants about 44 GB. People ask this constantly, and the answer is a 48 GB card, two 24 GB cards, or heavy CPU offload that runs slowly.

48 GB and 80 GB: the 70B tier and the multi-GPU wall

Verdict: 48 GB runs a 70B tightly, 80 GB runs it with room, and a 405B needs more than any single card here.

A 48 GB card (an L40S, or two 24 GB cards pooled) is the smallest practical home for a 70B-class model. Llama 3.1 70B at Q4 with 8K context needs about 44.3 GB and Qwen 2.5 72B about 45.9 GB, so both load on 48 GB but tightly. Push context hard and you will need to drop a quant level or shorten it.

An 80 GB accelerator (A100, H100) runs those same 70B-to-72B models comfortably, with headroom for long context and larger batches. It also handles Mixtral 8x7B and every smaller model trivially.

  • 48 GB: 70B to 72B at Q4 fit tightly; 32B and below are comfortable; Mixtral 8x7B fits.
  • 80 GB: 70B to 72B fit comfortably; everything below is easy.
  • Won't fit on either, single card: Llama 3.1 405B. At Q4 its weights alone are about 219 GB and the total is roughly 245 GB, so you are into multi-GPU territory: several 80 GB cards, or a quant aggressive enough that quality suffers. There is no single-card shortcut.

The 405B is the cleanest illustration of why the GPU-first question matters. Its weights alone, at 4-bit, are about 219 GB. No headroom trick changes that. Some models are simply a different class of hardware problem, and knowing the boundary before you start a 200 GB download saves a long, pointless wait.

Levers that move a model across the line

When a model lands one tier above your card, you have four levers before you give up. The checker spells out which one closes your specific gap and by how much.

  • Lower the quantization. This is the strongest lever because weights dominate. FP16 is 2 bytes per weight; Q4_K_M is about 0.58. Going from FP16 to Q4 cuts weight memory by roughly 70%. Going from Q8 to Q4 cuts it by nearly half. Quality drops a little; capacity drops a lot.
  • Shorten the context. The KV cache grows linearly with context length. A model that overflows at 32K can fit at 4K. If your prompts are short, you can run a bigger model than the default-context numbers suggest.
  • Offload layers to system RAM. Runtimes like llama.cpp keep some layers on the CPU and the rest on the GPU. This lets you run a model larger than your VRAM, but generation slows because the offloaded layers cross the slower CPU-to-GPU link. Fine for occasional use, painful for heavy use.
  • Add a second GPU. Splitting a model across cards sums their memory. Two 24 GB cards behave, for capacity, like one 48 GB card. This is the standard route to 70B-class models.

Apple Silicon is its own case. Unified memory shares one pool between CPU and GPU, so a 64 GB Mac can devote far more to a model than a discrete consumer GPU at a similar price. Capacity stops being the bottleneck; memory bandwidth does. Enter your Mac's usable memory as the GB value in the checker and read the verdict the same way.

Use the tool

Skip the manual work. The companion tool runs this in your browser, with nothing uploaded.

Open the tool

Frequently asked questions

I have a 12 GB card. What is the largest model I can actually run?

A 14B-class model at 4-bit (Q4_K_M) with an 8K context. Qwen 2.5 14B needs about 10.4 GB that way, which fits with a little headroom. The whole 7B-to-9B field, including Gemma 2 9B, fits comfortably. A 32B is out of reach at a usable quant. Run your exact model and context through the GPU Requirements Checker to confirm before downloading.

Why won't a 70B model fit on my RTX 4090 (24 GB)?

At Q4_K_M a 70B model's weights alone are about 38 GB, and with KV cache and overhead the total is roughly 44 GB. That is far past 24 GB. You need a 48 GB card, two 24 GB cards pooled, or CPU offload that runs slowly. Very aggressive low-bit quants get closer but cost noticeable quality.

Can I run a 7B model on an 8 GB GPU?

Yes, comfortably. Llama 3.1 8B, Mistral 7B, and Qwen 2.5 7B all need about 5 to 6 GB at Q4 with an 8K context, leaving room to spare. The one to watch is Gemma 2 9B, whose KV cache grows faster with context, so at long context it can spill over 8 GB unless you shorten it.

Does the answer change if I want a long context?

Yes. The KV cache grows linearly with context length, so a model that fits at 8K can overflow at 32K or 128K. For some models the cache at very long context rivals the weights. Always check fit at the context length you actually plan to use, not the default.

Mixtral has 8 experts. Why does it need so much VRAM if it only uses a couple per token?

Active parameters drive speed; total parameters drive memory. A mixture-of-experts model activates only a fraction of its weights per token, but it must store all of them in VRAM. Mixtral 8x7B is about 46.7B total parameters (around 12.9B active per token), so at Q4 it needs roughly 29 GB and will not fit a single 24 GB card.

What is the difference between 'fits' and 'fits with offload'?

'Fits' means the whole model lives in VRAM with headroom. 'Fits with offload' means it is slightly too big, so the runtime keeps some layers in system RAM and computes them over the slower CPU-to-GPU link. It works and lets you run a larger model than your card alone allows, but generation is slower. The checker tells you which case you are in and how far over the line you are.

Related tools