Full Fine-Tuning VRAM Calculator

Work out how much GPU memory a full fine-tune really needs — the kind where every weight is trainable, not a LoRA adapter. Full fine-tuning is dominated not by the weights but by the gradients and optimizer states that training adds: a plain AdamW run needs roughly sixteen bytes per parameter before a single activation is stored. Pick a model, choose your optimizer and parallelism, and the calculator breaks the total into weights, gradients, optimizer, and activations, then shows which GPUs can hold it. Everything is computed in your browser.

Model weights (fp16)
Gradients (fp16)
Optimizer states
Activations
Total — single GPU, no sharding
Per-GPU with chosen parallelism

Which GPUs fit (per GPU)?

How to use the Full Fine-Tuning VRAM Calculator

Choose a model preset — it fills in the parameter count, hidden size, and layer count — or pick Custom and enter them by hand. Then set the optimizer you will train with, your per-GPU batch size, and the sequence length. The breakdown updates as you type.

The four memory rows show where the GPU is spent. Weights and gradients are each two bytes per parameter at mixed precision. Optimizer states are usually the largest single line — AdamW keeps a 32-bit master copy of the weights plus two 32-bit moment buffers, which is twelve bytes per parameter on its own. Activations depend on batch and sequence length; tick Gradient checkpointing to recompute them in the backward pass and cut that row dramatically.

To make a large model fit, use the Parallelism dropdown. ZeRO-1 shards the optimizer states across your GPUs, ZeRO-2 also shards gradients, and ZeRO-3 (equivalently PyTorch FSDP) shards the weights too — so the per-GPU figure falls roughly in proportion to the GPU count. Set the number of GPUs and watch the per-GPU total drop, then check the fit table at the bottom. The estimate adds an 8% allowance for the CUDA context and communication buffers; real usage varies with the framework and flash-attention settings.

Where full fine-tuning memory goes

Inference only needs the weights. Training needs the weights and everything required to update them, which is why a model that runs happily on one card can be impossible to fully fine-tune on the same hardware. The standard accounting for mixed-precision AdamW comes to sixteen bytes per parameter: two bytes for the fp16 weights used in the forward and backward passes, two bytes for the fp16 gradients, and twelve bytes of optimizer state — a 32-bit master copy of the weights (four bytes) plus Adam's first and second moment estimates (four bytes each). For an 8-billion-parameter model that is about 128 GB before any activations, which is why full fine-tuning an 8B model overflows a single 80 GB GPU.

The optimizer choice moves this number a lot. 8-bit Adam stores the two moment buffers in int8 rather than fp32, cutting roughly six bytes per parameter. SGD with momentum keeps only one extra buffer, and Adafactor factorises the second moment to almost nothing. None of these touch the weights or gradients, so the floor for full fine-tuning is still higher than for parameter-efficient methods like LoRA, which freeze the base weights and train a tiny number of new ones.

Activations are the other variable. During the forward pass every layer stores intermediate tensors needed for the backward pass, and that memory scales with batch size, sequence length, hidden size, and layer count. For long sequences it can rival the optimizer states. Gradient checkpointing trades compute for memory by keeping only each layer's input and recomputing the rest during the backward pass, shrinking activations by an order of magnitude at the cost of roughly a third more training time. Combine checkpointing with ZeRO sharding and 8-bit Adam and a full fine-tune that nominally needs hundreds of gigabytes can be squeezed onto a handful of GPUs.

Common use cases

  • Deciding full fine-tune vs LoRA. See the true memory floor of full fine-tuning before choosing a parameter-efficient method instead.
  • Sizing a training box. Find out how many GPUs and which type you need to fit a given model, batch, and sequence length.
  • Tuning the memory knobs. Compare AdamW against 8-bit Adam, and full activations against gradient checkpointing, to find a configuration that fits.
  • Planning ZeRO / FSDP sharding. Check how many GPUs ZeRO-3 needs to bring per-GPU memory under your card's capacity.
  • Avoiding out-of-memory crashes. Validate a configuration on paper before launching a multi-hour run that would otherwise crash at the first backward pass.

Frequently asked questions

Why does full fine-tuning need 16 bytes per parameter?

Mixed-precision AdamW keeps fp16 weights (2 bytes), fp16 gradients (2 bytes), and 12 bytes of optimizer state: a 32-bit master copy of the weights (4) plus the first and second Adam moments (4 each). That is 16 bytes per parameter for weights and optimizer combined, before activations. It is the standard figure from the ZeRO paper and the reason training memory dwarfs inference memory.

How much does gradient checkpointing actually save?

It removes most of the activation memory. Instead of storing every layer's intermediate tensors, it keeps only each layer's input and recomputes the rest during the backward pass. For long sequences that can cut activation memory by 5–10×, at the cost of roughly 20–35% more training time from the recomputation. Toggle the checkbox to see the difference for your configuration.

What is the difference between ZeRO-1, ZeRO-2 and ZeRO-3?

They shard progressively more state across your GPUs. ZeRO-1 shards the optimizer states, ZeRO-2 also shards the gradients, and ZeRO-3 (the same idea as PyTorch FSDP) also shards the parameters themselves. The more you shard, the lower the per-GPU memory, but the more inter-GPU communication you add. ZeRO-3 lets very large models fit but needs fast interconnect to stay efficient.

Is this for full fine-tuning or LoRA?

Full fine-tuning, where every weight is trainable. LoRA and QLoRA freeze the base weights and train a small set of adapter parameters, so their optimizer and gradient memory is tiny by comparison — only the activations and the frozen base weights matter. Use a LoRA-specific calculator for those; this tool deliberately models the heavier full-tuning case.

Why is my real usage higher than the estimate?

Frameworks allocate temporary buffers for communication, mixed-precision casting, and flash-attention workspaces that vary by version and configuration. Memory fragmentation also leaves gaps the allocator cannot reuse. The 8% overhead here is a planning allowance; for tight fits, leave a few extra gigabytes of headroom and enable expandable memory segments if your framework supports it.