LLM Training FLOPs Calculator
Estimate the compute needed to train or run a large language model. Training cost follows the well-known 6 × N × D rule (six FLOPs per parameter per token); inference costs about two FLOPs per parameter per token. Enter the model size and token count, pick a GPU and its model-FLOPs-utilization, and the calculator returns total FLOPs, GPU-hours, wall-clock time, and dollar cost. Runs entirely in your browser.
FLOPs use the 6ND approximation (forward + backward) for training and 2N per token for inference. TFLOPS figures are dense BF16 peaks; real throughput is peak × MFU. GPU rates are representative on-demand prices — edit to match your host.
How to use the LLM Training FLOPs Calculator
Choose training or inference. For training, enter the parameter count in billions and the number of training tokens; the calculator multiplies them by six to get total FLOPs. For inference, the token field becomes the number of tokens you want to process, costed at two FLOPs per parameter per token. Then set the hardware: pick a GPU to load its peak BF16 throughput, set how many you run in parallel, and set the model-FLOPs-utilization (MFU) — the fraction of peak you actually sustain. The result is total FLOPs, the GPU-hours required, the wall-clock time across all GPUs, and the dollar cost at your hourly rate.
MFU is the honest part of the estimate: real training runs sustain only 30–55% of a GPU's peak FLOPs because of memory movement, communication, and pipeline bubbles. Lowering MFU raises both time and cost proportionally, so use a realistic figure rather than the theoretical peak.
The 6ND rule and model-FLOPs-utilization
The compute to train a dense transformer is captured remarkably well by a single formula: FLOPs ≈ 6 × N × D, where N is the parameter count and D is the number of training tokens. The factor of six comes from roughly two FLOPs per parameter for the forward pass and four for the backward pass (computing gradients with respect to both activations and weights). It is the formula behind every headline training-compute number, from GPT-3's 3.1 × 10²³ FLOPs to the 10²⁵-plus budgets of recent frontier models.
Inference is cheaper per token: a forward pass is about two FLOPs per parameter per token, so serving 1,000 tokens through a 70-billion-parameter model is roughly 1.4 × 10¹⁴ FLOPs. That asymmetry is why training a model once can cost millions of dollars while a single inference call costs a fraction of a cent.
Turning FLOPs into time and money requires model-FLOPs-utilization (MFU) — the fraction of a GPU's theoretical peak you actually achieve. Memory bandwidth limits, all-reduce communication across GPUs, and pipeline gaps mean even well-tuned training runs sustain only 30–55% of peak; inference is often lower. Multiplying the GPU's peak TFLOPS by your MFU and the number of GPUs gives the effective throughput, and dividing total FLOPs by that gives the wall-clock time. This calculator does that arithmetic and applies your hourly GPU rate to get a cost.
Common use cases
- Budgeting a training run. Estimate the GPU-hours and dollars to pre-train or fine-tune a model.
- Sizing a cluster. See how many GPUs you need to finish training within a target number of days.
- Reproducing paper figures. Check a model's reported training compute against its size and token count.
- Inference capacity planning. Estimate the compute to serve a given monthly token volume.