LLM Training FLOPs Calculator

Estimate the compute needed to train or run a large language model. Training cost follows the well-known 6 × N × D rule (six FLOPs per parameter per token); inference costs about two FLOPs per parameter per token. Enter the model size and token count, pick a GPU and its model-FLOPs-utilization, and the calculator returns total FLOPs, GPU-hours, wall-clock time, and dollar cost. Runs entirely in your browser.

FLOPs use the 6ND approximation (forward + backward) for training and 2N per token for inference. TFLOPS figures are dense BF16 peaks; real throughput is peak × MFU. GPU rates are representative on-demand prices — edit to match your host.

How to use the LLM Training FLOPs Calculator

Choose training or inference. For training, enter the parameter count in billions and the number of training tokens; the calculator multiplies them by six to get total FLOPs. For inference, the token field becomes the number of tokens you want to process, costed at two FLOPs per parameter per token. Then set the hardware: pick a GPU to load its peak BF16 throughput, set how many you run in parallel, and set the model-FLOPs-utilization (MFU) — the fraction of peak you actually sustain. The result is total FLOPs, the GPU-hours required, the wall-clock time across all GPUs, and the dollar cost at your hourly rate.

MFU is the honest part of the estimate: real training runs sustain only 30–55% of a GPU's peak FLOPs because of memory movement, communication, and pipeline bubbles. Lowering MFU raises both time and cost proportionally, so use a realistic figure rather than the theoretical peak.

The 6ND rule and model-FLOPs-utilization

The compute to train a dense transformer is captured remarkably well by a single formula: FLOPs ≈ 6 × N × D, where N is the parameter count and D is the number of training tokens. The factor of six comes from roughly two FLOPs per parameter for the forward pass and four for the backward pass (computing gradients with respect to both activations and weights). It is the formula behind every headline training-compute number, from GPT-3's 3.1 × 10²³ FLOPs to the 10²⁵-plus budgets of recent frontier models.

Inference is cheaper per token: a forward pass is about two FLOPs per parameter per token, so serving 1,000 tokens through a 70-billion-parameter model is roughly 1.4 × 10¹⁴ FLOPs. That asymmetry is why training a model once can cost millions of dollars while a single inference call costs a fraction of a cent.

Turning FLOPs into time and money requires model-FLOPs-utilization (MFU) — the fraction of a GPU's theoretical peak you actually achieve. Memory bandwidth limits, all-reduce communication across GPUs, and pipeline gaps mean even well-tuned training runs sustain only 30–55% of peak; inference is often lower. Multiplying the GPU's peak TFLOPS by your MFU and the number of GPUs gives the effective throughput, and dividing total FLOPs by that gives the wall-clock time. This calculator does that arithmetic and applies your hourly GPU rate to get a cost.

Common use cases

  • Budgeting a training run. Estimate the GPU-hours and dollars to pre-train or fine-tune a model.
  • Sizing a cluster. See how many GPUs you need to finish training within a target number of days.
  • Reproducing paper figures. Check a model's reported training compute against its size and token count.
  • Inference capacity planning. Estimate the compute to serve a given monthly token volume.

Frequently asked questions

Where does the factor of 6 come from?

A forward pass costs about two FLOPs per parameter per token (one multiply and one add). The backward pass costs about twice that — roughly four FLOPs — because it computes gradients with respect to both activations and weights. Together that is six FLOPs per parameter per token, giving the 6ND training rule.

What is a realistic MFU?

Well-optimized large-scale training runs report 35–55% MFU on modern GPUs; 40% is a reasonable default. Smaller or communication-heavy setups can be lower, and inference MFU is often below training because batch sizes and sequence lengths vary. Use a number you can actually sustain, not the theoretical peak.

Why does my cost differ from a published training cost?

Published costs include data preparation, failed runs, checkpointing overhead, and idle time, and often use reserved or owned hardware at very different hourly rates. This calculator estimates the ideal compute cost for a single clean run, which is a lower bound on the real project cost.

Does the 6ND rule work for mixture-of-experts models?

Use the active parameter count, not the total. An MoE model routes each token through only a few experts, so its training and inference FLOPs scale with active parameters. Enter the active count to get a meaningful estimate.

Are the TFLOPS numbers peak or sustained?

They are dense BF16 peak figures from the GPU vendors. Sustained throughput is peak times your MFU, which is why the MFU field matters so much — a 990-TFLOPS H100 at 40% MFU delivers about 396 effective TFLOPS for training.
Embed this tool on your site

Free to embed, no attribution required (but appreciated). Paste this where you want the tool to appear: