LLM Fine-Tuning Cost Calculator

Estimate what it costs to fine-tune a model. Training compute follows a simple rule — roughly 6 × parameters × tokens floating-point operations for a full pass — and from there it is just your GPU's throughput and hourly price. Choose a method (LoRA, QLoRA, or full), enter your dataset and model size, and get GPU-hours and a dollar estimate. All local; no data leaves your browser.

Total tokens processed
Total compute
GPU-hours
Estimated cost

How to use the LLM Fine-Tuning Cost Calculator

Pick a fine-tuning method: full fine-tuning updates every weight and uses the most compute; LoRA freezes the base model and trains small adapters, cutting compute by roughly a third; QLoRA is LoRA on top of a 4-bit quantized base, which slashes memory so it fits on smaller cards but runs a little slower. Enter your model size, the number of training examples and their average length, and how many epochs you will run.

Choose a GPU to fill in its throughput and a typical rental price, or type your own price per GPU-hour. The MFU (model FLOPs utilization) slider is the fraction of the GPU's peak you actually achieve — 30-50% is realistic for well-tuned training, so 40% is a sensible default. The result gives total tokens, total compute, GPU-hours, and the dollar cost. Wall-clock time is GPU-hours divided by the number of GPUs you run in parallel.

The arithmetic of fine-tuning cost

Training compute is remarkably predictable. A forward pass through a dense transformer costs about 2 × N floating-point operations per token, where N is the parameter count; the backward pass costs about twice that, giving the well-known 6ND rule — six times parameters times tokens for a full training run. Multiply by epochs and you have the total compute. Divide by what your GPU can actually deliver (its peak throughput times the utilization you achieve) and you get GPU-hours; multiply by the hourly price and you get dollars.

The method changes the constant. Full fine-tuning pays the whole 6ND because it computes gradients for every weight. LoRA freezes the base model and only trains small low-rank adapters, so it skips the weight-gradient computation — roughly a 4ND profile — while still running a full forward and backward through the frozen network. QLoRA uses the same adapter approach but keeps the frozen base in 4-bit, which dramatically lowers memory (letting you fine-tune a 70B model on a single 48 GB card) at the cost of some speed from dequantizing weights on the fly.

The number people most often get wrong is utilization. A GPU's advertised TFLOPS is a peak that real training never reaches — data loading, communication, attention overhead, and imperfect kernels mean 30-50% is typical, and large distributed runs can be lower. That is the MFU figure here. Note too that this estimates compute cost only: it does not include data preparation, failed runs, hyperparameter sweeps, or storage. For a single clean run it is accurate; budget a multiple of it for a real project with experimentation.

Common use cases

  • Budgeting a fine-tune. Get a dollar figure before you rent GPUs, so there are no surprises on the cloud bill.
  • Choosing LoRA vs full. See how much compute LoRA saves over full fine-tuning for your model and dataset.
  • Comparing GPUs. Weigh a cheap RTX 4090 against an H100 — higher price per hour but far more throughput.
  • Sizing a dataset. See how cost scales as you add examples or epochs, and find the point of diminishing returns.

Frequently asked questions

What is the 6ND rule?

A standard estimate that a full training pass costs about six times the parameter count times the number of tokens in floating-point operations: roughly 2N for the forward pass and 4N for the backward pass per token. It is the basis of most training-compute and cost estimates, including this one.

Why is LoRA cheaper than full fine-tuning?

LoRA freezes the base model and trains only small adapter matrices, so it does not compute or store gradients for the billions of frozen weights. That removes a large part of the backward-pass work and most of the optimizer memory, giving roughly a 4ND profile instead of 6ND, plus far lower memory.

What MFU should I use?

Model FLOPs utilization is the fraction of a GPU's peak throughput you actually achieve. Well-optimized single-node training reaches 40-50%; large multi-node runs are often 30-40% or lower. The 40% default is reasonable; lower it if your pipeline is data-bound or poorly optimized.

Does this include the cost of failed runs and tuning?

No. It estimates one clean training run's compute. Real projects involve data preparation, hyperparameter sweeps, and the occasional crashed run, so multiply the figure by a comfortable factor when budgeting a whole project rather than a single run.

Why does QLoRA show a time penalty?

QLoRA stores the frozen base model in 4-bit and dequantizes weights on the fly during the forward and backward passes. That extra work makes each step roughly 30% slower than standard LoRA, which is reflected here, in exchange for a large reduction in memory that lets big models fit on small GPUs.
Embed this tool on your site

Free to embed, no attribution required (but appreciated). Paste this where you want the tool to appear: