LLM Fine-Tuning Cost Calculator
Estimate what it costs to fine-tune a model. Training compute follows a simple rule — roughly 6 × parameters × tokens floating-point operations for a full pass — and from there it is just your GPU's throughput and hourly price. Choose a method (LoRA, QLoRA, or full), enter your dataset and model size, and get GPU-hours and a dollar estimate. All local; no data leaves your browser.
| Total tokens processed | |
| Total compute | |
| GPU-hours | |
| Estimated cost |
How to use the LLM Fine-Tuning Cost Calculator
Pick a fine-tuning method: full fine-tuning updates every weight and uses the most compute; LoRA freezes the base model and trains small adapters, cutting compute by roughly a third; QLoRA is LoRA on top of a 4-bit quantized base, which slashes memory so it fits on smaller cards but runs a little slower. Enter your model size, the number of training examples and their average length, and how many epochs you will run.
Choose a GPU to fill in its throughput and a typical rental price, or type your own price per GPU-hour. The MFU (model FLOPs utilization) slider is the fraction of the GPU's peak you actually achieve — 30-50% is realistic for well-tuned training, so 40% is a sensible default. The result gives total tokens, total compute, GPU-hours, and the dollar cost. Wall-clock time is GPU-hours divided by the number of GPUs you run in parallel.
The arithmetic of fine-tuning cost
Training compute is remarkably predictable. A forward pass through a dense transformer costs about 2 × N floating-point operations per token, where N is the parameter count; the backward pass costs about twice that, giving the well-known 6ND rule — six times parameters times tokens for a full training run. Multiply by epochs and you have the total compute. Divide by what your GPU can actually deliver (its peak throughput times the utilization you achieve) and you get GPU-hours; multiply by the hourly price and you get dollars.
The method changes the constant. Full fine-tuning pays the whole 6ND because it computes gradients for every weight. LoRA freezes the base model and only trains small low-rank adapters, so it skips the weight-gradient computation — roughly a 4ND profile — while still running a full forward and backward through the frozen network. QLoRA uses the same adapter approach but keeps the frozen base in 4-bit, which dramatically lowers memory (letting you fine-tune a 70B model on a single 48 GB card) at the cost of some speed from dequantizing weights on the fly.
The number people most often get wrong is utilization. A GPU's advertised TFLOPS is a peak that real training never reaches — data loading, communication, attention overhead, and imperfect kernels mean 30-50% is typical, and large distributed runs can be lower. That is the MFU figure here. Note too that this estimates compute cost only: it does not include data preparation, failed runs, hyperparameter sweeps, or storage. For a single clean run it is accurate; budget a multiple of it for a real project with experimentation.
Common use cases
- Budgeting a fine-tune. Get a dollar figure before you rent GPUs, so there are no surprises on the cloud bill.
- Choosing LoRA vs full. See how much compute LoRA saves over full fine-tuning for your model and dataset.
- Comparing GPUs. Weigh a cheap RTX 4090 against an H100 — higher price per hour but far more throughput.
- Sizing a dataset. See how cost scales as you add examples or epochs, and find the point of diminishing returns.