LLM GPU Requirements Checker
Find out whether a model will actually run on your GPU before you download it. Pick the model, the quantization, and the context length you want, then choose your card — the checker compares the required VRAM (weights plus KV cache plus overhead) against your memory and gives a plain verdict: fits, fits with offload, or won't fit, with concrete advice for each. All local.
How to use the LLM GPU Requirements Checker
Choose the model you are considering, the quantization you intend to load it at, and the context length you want to support. Then pick the GPU you own (or are thinking of buying). The checker adds up the model weights, the KV cache for your context, and a runtime overhead allowance, then compares the total to your card's memory.
You will get one of three verdicts. Fits means the model loads with comfortable headroom. Tight / fits with offload means it is close — you may need to offload a few layers to system RAM, shorten the context, or accept little spare memory. Won't fit means you need a lower quant, a smaller model, or a bigger card. In every case the result lists the specific levers — drop to a smaller quant, reduce context, or split across GPUs — with the numbers that would make it fit.
What determines whether a model fits
Three things decide whether an LLM fits in your GPU's memory. The biggest is usually the weights: parameter count times bytes per weight. At FP16 that is 2 bytes each, so a 70B model needs ~140 GB and will never fit a single consumer card — but the same model at 4-bit drops to ~40 GB, which two 24 GB cards or one 48 GB card can hold. Quantization is the single most powerful lever, and it is why the quant selector changes the verdict so dramatically.
The second factor is the KV cache, the memory attention uses to remember past tokens. It is small at short context but grows linearly, so a model that fits at 4K context can overflow at 32K or 128K. If you only need short prompts, you can run a bigger model; if you need long context, budget for the cache. The third is a fixed overhead for the runtime and activations, which this checker includes automatically.
When a model does not fit on one card you have options. Lower quantization trades a little quality for a lot of memory. CPU offload keeps some layers in system RAM, which works but slows generation — fine for occasional use, painful for heavy use. Multi-GPU splits the model across cards and sums their memory, the standard approach for 70B-plus models. The verdict here tells you which of these you will need and roughly how far over the line you are.
Common use cases
- Before downloading. Confirm a model will load on your card so you do not waste a large download.
- Before buying a GPU. Check whether 16 GB is enough or you really need 24 GB for the models you want.
- Picking a quant. See exactly which quantization turns a "won't fit" into a "fits" on your hardware.
- Deciding on multi-GPU. Find out whether you need a second card and how much combined memory the model wants.