Chinchilla Compute-Optimal Calculator
Apply the Chinchilla scaling laws to size an LLM training run. Give the calculator one of three things — a parameter count, a token budget, or a raw compute budget in FLOPs — and it returns the compute-optimal counterpart, the total training FLOPs from the 6ND rule, and an optional GPU-hour and dollar estimate for the run. It is the fastest way to sanity-check whether a model size and dataset are balanced before anyone reserves a cluster.
Training cost (optional)
| Compute-optimal parameters | |
| Compute-optimal tokens | |
| Tokens : parameters | |
| Training compute (6ND) | |
| In PFLOP-days | |
| GPU-hours | |
| Wall-clock time | |
| Estimated cost |
How to use the Chinchilla Compute-Optimal Calculator
Pick what you already know in the Solve from dropdown. If you have a target model size, choose Parameter count and enter it in billions — the tool returns the token budget Chinchilla says that model should see. If you instead have a fixed dataset, choose Token budget to get the largest model that data can train optimally. If all you have is a compute allocation, choose Compute budget and enter it in ZFLOP (1 ZFLOP = 10²¹ FLOPs) to get both the optimal parameter count and token count at once.
The output table fills in immediately. The first two rows are the compute-optimal pair; the tokens : parameters row shows the roughly 20-to-1 ratio that defines the Chinchilla point. Training compute is the total work in FLOPs from the standard C ≈ 6ND approximation, also shown in PFLOP-days for easy comparison with published runs.
To turn FLOPs into time and money, fill in the optional cost panel: choose a GPU, set a realistic model FLOPs utilisation (40% is a good default for large dense training; very well-tuned runs reach 50–55%), the number of GPUs, and your hourly rental rate. The calculator then estimates GPU-hours, wall-clock duration, and total dollar cost. All arithmetic runs in your browser.
What the Chinchilla scaling laws say
In 2022 DeepMind's Training Compute-Optimal Large Language Models paper (Hoffmann et al., the "Chinchilla" paper) showed that the models of the day were badly undertrained. For a fixed compute budget, the best loss comes from scaling parameters and training tokens in roughly equal proportion — about 20 tokens per parameter. A 70B model is compute-optimal at around 1.4 trillion tokens; doubling the model should come with doubling the data.
The other half of the arithmetic is the cost of a training step. For a dense transformer, one forward-and-backward pass over a token costs about 6N FLOPs, where N is the parameter count (two FLOPs per parameter for the forward multiply-add, doubled again for the backward pass). Multiply by the number of tokens D and the whole run costs C ≈ 6ND FLOPs. Substituting the optimal D = 20N gives C ≈ 120N², which is how the compute-budget mode inverts a FLOP figure back into an optimal model size.
Two caveats matter in 2026. First, Chinchilla optimises training cost only — it ignores inference. Because a smaller model is cheaper to serve, teams deliberately "overtrain" past the 20:1 point: Llama 3 8B saw 15 trillion tokens, nearly 100 tokens per parameter, trading extra training compute for a model that is far cheaper to run at scale. Second, the 6ND rule is an approximation that omits attention's quadratic term and assumes a dense model; mixture-of-experts and very long context shift the constant. Treat the numbers here as a well-grounded planning estimate, not an exact budget.
Common use cases
- Sizing a pre-training run. Decide how many tokens to collect for a target model size, or how big a model your existing corpus can justify.
- Sanity-checking a plan. If someone proposes a 30B model on 300B tokens, the 20:1 rule flags it as undertrained at a glance.
- Estimating budget. Convert a model-and-data plan into FLOPs, GPU-hours, and a dollar figure before reserving compute.
- Comparing to published models. Express a run in PFLOP-days to line it up against GPT-3, Chinchilla, or Llama training compute.
- Teaching scaling intuition. See directly how compute grows with the square of model size at the optimal point.