LoRA Memory Calculator

Estimate how many parameters a LoRA fine-tune will train, and the memory the adapter, gradients and optimizer states add on top of the frozen base model. Set the model dimensions (or pick a preset), the rank, and which target modules to adapt, and get the trainable parameter count, its fraction of the base model, and a memory breakdown — computed in your browser, GQA-aware for grouped-query attention.

How to use the LoRA Memory Calculator

Pick a model preset to fill in the architecture — hidden size, FFN size, layer count, and attention and key/value head counts — or enter your own. Set the LoRA rank (commonly 8, 16, 32 or 64) and tick the modules you'll adapt: the attention projections q/k/v/o and the MLP projections gate/up/down. The calculator multiplies each adapted matrix's r·(in + out) parameter cost across all layers and shows the total trainable count, its percentage of the base model, and a per-module table.

The memory cards estimate the overhead LoRA adds on top of the frozen base weights: the adapter parameters themselves, the gradients (same dtype), and the AdamW optimizer states (two fp32 moments per trainable parameter). It correctly accounts for grouped-query attention, where the key and value projections are smaller than the query projection, so the figures match modern Llama-3- and Mistral-style models. The base model weights, activations and KV cache are separate and depend on your batch size, sequence length and how the base is loaded (full precision, 8-bit or 4-bit/QLoRA).

Why LoRA trains so few parameters

LoRA — Low-Rank Adaptation — is the dominant method for fine-tuning large models on modest hardware. Instead of updating a weight matrix W directly, it freezes W and learns a small additive correction expressed as the product of two thin matrices: ΔW = B·A, where A is r × in and B is out × r, for a chosen rank r far smaller than the layer's dimensions. The trainable cost of adapting one matrix therefore drops from in × out to just r × (in + out) — typically a reduction of a hundred- to a thousand-fold — while the frozen original is left untouched and the two are summed at inference.

That arithmetic is why a 7-to-8-billion-parameter model can be fine-tuned with only tens of millions of trainable parameters, often well under 1% of the whole. The savings compound through memory: the optimizer is usually the biggest consumer in full fine-tuning, because Adam keeps two fp32 moment estimates per trainable parameter, and gradients add another copy. By shrinking the trainable set, LoRA shrinks gradients and optimizer state in proportion, which is what lets the job fit on a single consumer GPU — especially in the QLoRA variant, where the frozen base is additionally quantized to 4-bit so even its static weights take a quarter of the room.

The figures here are an estimate of the adapter-specific overhead, which depends on three choices: the rank, which modules you target, and the architecture's dimensions. Adapting only the attention projections is cheaper than also adapting the wide MLP matrices; a higher rank gives the adapter more capacity at a linear cost in parameters. What this calculator does not include — because it varies with your run — is the frozen base weights, the activations held for the backward pass (which scale with batch size and sequence length), and the KV cache. Use it to size the trainable footprint and compare configurations; budget the base-model and activation memory separately on top.

Common use cases

  • Fits-on-my-GPU checks. See the trainable-parameter and optimizer overhead before launching a run.
  • Rank selection. Compare how rank 8 vs 64 changes the parameter budget.
  • Module choices. Weigh attention-only LoRA against adapting the MLP layers too.
  • QLoRA planning. Separate the small adapter overhead from the quantized base weights.

Frequently asked questions

How is the trainable parameter count computed?

For every targeted matrix of shape (in, out), LoRA adds r·(in + out) parameters — the two low-rank factors A and B. The tool sums this across the modules you select and multiplies by the number of layers. It is GQA-aware, so the k_proj and v_proj costs use the smaller key/value dimension when KV heads are fewer than attention heads.

Does this include the base model memory?

No. The figures are the LoRA-specific overhead: the adapter weights, gradients and optimizer states. The frozen base model weights, the activations stored for backprop, and the KV cache are separate and depend on batch size, sequence length and whether you load the base in fp16, 8-bit or 4-bit.

Why are the optimizer states 8 bytes per parameter?

AdamW keeps two running statistics per trainable parameter — the first and second moment — and they're typically stored in fp32 (4 bytes each), giving 8 bytes per parameter. This is usually the largest single contributor to fine-tuning memory, which is why reducing the trainable set helps so much.

What rank should I use?

Common choices are 8, 16, 32 and 64. Higher rank gives the adapter more capacity at a cost that grows linearly in parameters; many tasks work well at 16 or 32. The right value depends on the task and dataset size, so it's worth comparing a couple of settings.

Are the preset numbers exact?

The presets use published architecture dimensions, and the parameter maths is exact for the LoRA formula. Treat the memory figures as close estimates for planning — real usage also depends on framework overhead, gradient checkpointing and your exact training configuration.