LLM GPU Requirements Checker

Find out whether a model will actually run on your GPU before you download it. Pick the model, the quantization, and the context length you want, then choose your card — the checker compares the required VRAM (weights plus KV cache plus overhead) against your memory and gives a plain verdict: fits, fits with offload, or won't fit, with concrete advice for each. All local.

How to use the LLM GPU Requirements Checker

Choose the model you are considering, the quantization you intend to load it at, and the context length you want to support. Then pick the GPU you own (or are thinking of buying). The checker adds up the model weights, the KV cache for your context, and a runtime overhead allowance, then compares the total to your card's memory.

You will get one of three verdicts. Fits means the model loads with comfortable headroom. Tight / fits with offload means it is close — you may need to offload a few layers to system RAM, shorten the context, or accept little spare memory. Won't fit means you need a lower quant, a smaller model, or a bigger card. In every case the result lists the specific levers — drop to a smaller quant, reduce context, or split across GPUs — with the numbers that would make it fit.

What determines whether a model fits

Three things decide whether an LLM fits in your GPU's memory. The biggest is usually the weights: parameter count times bytes per weight. At FP16 that is 2 bytes each, so a 70B model needs ~140 GB and will never fit a single consumer card — but the same model at 4-bit drops to ~40 GB, which two 24 GB cards or one 48 GB card can hold. Quantization is the single most powerful lever, and it is why the quant selector changes the verdict so dramatically.

The second factor is the KV cache, the memory attention uses to remember past tokens. It is small at short context but grows linearly, so a model that fits at 4K context can overflow at 32K or 128K. If you only need short prompts, you can run a bigger model; if you need long context, budget for the cache. The third is a fixed overhead for the runtime and activations, which this checker includes automatically.

When a model does not fit on one card you have options. Lower quantization trades a little quality for a lot of memory. CPU offload keeps some layers in system RAM, which works but slows generation — fine for occasional use, painful for heavy use. Multi-GPU splits the model across cards and sums their memory, the standard approach for 70B-plus models. The verdict here tells you which of these you will need and roughly how far over the line you are.

Common use cases

  • Before downloading. Confirm a model will load on your card so you do not waste a large download.
  • Before buying a GPU. Check whether 16 GB is enough or you really need 24 GB for the models you want.
  • Picking a quant. See exactly which quantization turns a "won't fit" into a "fits" on your hardware.
  • Deciding on multi-GPU. Find out whether you need a second card and how much combined memory the model wants.

Frequently asked questions

What does "fits with offload" mean?

The model is slightly too big for your VRAM, but llama.cpp or a similar runtime can keep some layers in system RAM and run the rest on the GPU. It works and lets you use a larger model than your card alone allows, but generation is slower because the offloaded layers are computed over the slower CPU-GPU link.

Why does the verdict change so much with quantization?

Weights dominate VRAM, and quantization changes them directly. FP16 is 2 bytes per weight; 4-bit is roughly a quarter of that. So switching from FP16 to Q4_K_M cuts the weight memory by about 75%, which often turns a model that will not fit into one that does.

Does a longer context need a bigger GPU?

It can. The KV cache grows linearly with context length, so a long context adds VRAM on top of the weights. For some models the cache at 128K context rivals the weights themselves, so check the verdict at the context length you actually plan to use.

Can I run a 70B model on a 24 GB card?

Not fully in VRAM at a usable quant — 70B at 4-bit is around 40 GB. You would need two 24 GB cards, a 48 GB card, or heavy CPU offload (which is slow). At very low quants it gets closer, but quality suffers. The checker shows the exact gap.

Is unified memory (Apple Silicon) handled?

Use the GB option matching your Mac's usable memory. Apple Silicon shares memory between CPU and GPU, so a 64 GB Mac can devote a large share to the model — often more than a discrete consumer GPU — though bandwidth, not capacity, then limits speed.
Embed this tool on your site

Free to embed, no attribution required (but appreciated). Paste this where you want the tool to appear: