Chinchilla Compute-Optimal Calculator

Apply the Chinchilla scaling laws to size an LLM training run. Give the calculator one of three things — a parameter count, a token budget, or a raw compute budget in FLOPs — and it returns the compute-optimal counterpart, the total training FLOPs from the 6ND rule, and an optional GPU-hour and dollar estimate for the run. It is the fastest way to sanity-check whether a model size and dataset are balanced before anyone reserves a cluster.

Training cost (optional)

Compute-optimal parameters
Compute-optimal tokens
Tokens : parameters
Training compute (6ND)
In PFLOP-days
GPU-hours
Wall-clock time
Estimated cost

How to use the Chinchilla Compute-Optimal Calculator

Pick what you already know in the Solve from dropdown. If you have a target model size, choose Parameter count and enter it in billions — the tool returns the token budget Chinchilla says that model should see. If you instead have a fixed dataset, choose Token budget to get the largest model that data can train optimally. If all you have is a compute allocation, choose Compute budget and enter it in ZFLOP (1 ZFLOP = 10²¹ FLOPs) to get both the optimal parameter count and token count at once.

The output table fills in immediately. The first two rows are the compute-optimal pair; the tokens : parameters row shows the roughly 20-to-1 ratio that defines the Chinchilla point. Training compute is the total work in FLOPs from the standard C ≈ 6ND approximation, also shown in PFLOP-days for easy comparison with published runs.

To turn FLOPs into time and money, fill in the optional cost panel: choose a GPU, set a realistic model FLOPs utilisation (40% is a good default for large dense training; very well-tuned runs reach 50–55%), the number of GPUs, and your hourly rental rate. The calculator then estimates GPU-hours, wall-clock duration, and total dollar cost. All arithmetic runs in your browser.

What the Chinchilla scaling laws say

In 2022 DeepMind's Training Compute-Optimal Large Language Models paper (Hoffmann et al., the "Chinchilla" paper) showed that the models of the day were badly undertrained. For a fixed compute budget, the best loss comes from scaling parameters and training tokens in roughly equal proportion — about 20 tokens per parameter. A 70B model is compute-optimal at around 1.4 trillion tokens; doubling the model should come with doubling the data.

The other half of the arithmetic is the cost of a training step. For a dense transformer, one forward-and-backward pass over a token costs about 6N FLOPs, where N is the parameter count (two FLOPs per parameter for the forward multiply-add, doubled again for the backward pass). Multiply by the number of tokens D and the whole run costs C ≈ 6ND FLOPs. Substituting the optimal D = 20N gives C ≈ 120N², which is how the compute-budget mode inverts a FLOP figure back into an optimal model size.

Two caveats matter in 2026. First, Chinchilla optimises training cost only — it ignores inference. Because a smaller model is cheaper to serve, teams deliberately "overtrain" past the 20:1 point: Llama 3 8B saw 15 trillion tokens, nearly 100 tokens per parameter, trading extra training compute for a model that is far cheaper to run at scale. Second, the 6ND rule is an approximation that omits attention's quadratic term and assumes a dense model; mixture-of-experts and very long context shift the constant. Treat the numbers here as a well-grounded planning estimate, not an exact budget.

Common use cases

  • Sizing a pre-training run. Decide how many tokens to collect for a target model size, or how big a model your existing corpus can justify.
  • Sanity-checking a plan. If someone proposes a 30B model on 300B tokens, the 20:1 rule flags it as undertrained at a glance.
  • Estimating budget. Convert a model-and-data plan into FLOPs, GPU-hours, and a dollar figure before reserving compute.
  • Comparing to published models. Express a run in PFLOP-days to line it up against GPT-3, Chinchilla, or Llama training compute.
  • Teaching scaling intuition. See directly how compute grows with the square of model size at the optimal point.

Frequently asked questions

Why 20 tokens per parameter?

That is the empirical ratio the Chinchilla paper found minimises loss for a fixed training-compute budget, by fitting scaling curves across hundreds of runs. It balances spending compute on a bigger model against spending it on more data. The exact figure drifts a little with architecture and data quality, but 20:1 is the standard rule of thumb.

Where does the 6ND formula come from?

A forward pass through a dense transformer costs about 2N FLOPs per token (one multiply and one add per parameter), and the backward pass costs roughly twice the forward pass, giving about 6N FLOPs per token end to end. Multiplying by D tokens yields C ≈ 6ND. It ignores the attention softmax and is exact only in the large-model limit, but it is accurate to within a few percent for typical configurations.

Should I always train at the Chinchilla-optimal point?

Not necessarily. Chinchilla minimises training compute, but if a model will serve billions of inference tokens, a smaller model trained well past 20:1 is cheaper overall. Most recent open models are deliberately overtrained for this reason. Use this tool to find the optimal baseline, then decide how far to overtrain based on your inference volume.

What model FLOPs utilisation should I use?

MFU is the fraction of a GPU's peak FLOPs your training actually achieves after communication, memory, and pipeline bubbles. Large dense pre-training typically lands between 35% and 55%; 40% is a safe default. Smaller models, long context, and heavy parallelism push it lower. The cost estimate scales inversely with MFU, so it is the biggest lever on the time and dollar figures.

Does this work for mixture-of-experts models?

Only loosely. MoE models activate a fraction of their parameters per token, so the effective N for the 6ND compute is the active parameter count, not the total. Enter the active parameters for a closer FLOP estimate, but be aware that MoE scaling laws differ from the dense Chinchilla fit, so treat the result as a rough guide.