Transformer FLOPs Calculator

Estimate how much arithmetic a transformer does. Enter the sequence length, model dimension, layer count and feed-forward size and get the total FLOPs for a forward pass — or a training step using the ≈3× forward-plus-backward rule — with a per-component breakdown that separates the attention projections, the quadratic score-and-value terms, the MLP and the output logits. It shows how much of the cost is the quadratic attention term that grows with context length, supports SwiGLU/GLU MLPs, and runs entirely in your browser.

How to use the Transformer FLOPs Calculator

Enter the architecture: the sequence length you're processing, the model (hidden) dimension d, the number of transformer layers, the feed-forward width d_ff (leave it blank to default to 4·d), the vocabulary size for the final projection, and a batch size. Choose whether the MLP is a SwiGLU/GLU block — three matmuls, as in Llama and Mistral — or a standard two-matmul MLP, and whether you want the cost of a forward pass alone or a full training step. The breakdown updates immediately: total FLOPs, cost per forward pass, attention and MLP per layer, cost per token, and the logits projection.

The estimate counts each multiply-add as two FLOPs and sums the standard matrix-multiply costs: the QKV and output projections (6·s·d² and 2·s·d² per layer), the two quadratic attention terms (2·s²·d each, for the scores and the value aggregation), and the MLP (2·s·d·d_ff per matmul). Training multiplies the forward cost by three to approximate the backward pass. The figures cover the dense matmul arithmetic that dominates real runtime; they exclude softmax, normalisation, activation functions and elementwise operations, which are comparatively cheap, so treat the result as a tight lower bound for planning rather than an exact hardware counter.

Where a transformer spends its FLOPs

Almost all of a transformer's arithmetic is matrix multiplication, and counting it tells you most of what you need to know about cost. Each layer has two halves. The attention block projects the input into queries, keys and values (three d×d matmuls), computes a score matrix by multiplying queries against keys, uses it to take a weighted sum of values, and projects the result back out. The MLP block expands the dimension to a wider feed-forward size and contracts it again — two matmuls for a classic MLP, three for the gated SwiGLU variant used in most modern models. Stack that over many layers, add a final projection from the hidden state to vocabulary-sized logits, and you have the whole forward pass.

The crucial structural fact is that two of those terms scale with the square of the sequence length. The score matrix is sequence-by-sequence, and so is the value aggregation, so their cost grows as s²·d while everything else grows linearly in s. At short context the linear projection and MLP terms dominate and attention looks cheap; as the context grows into the tens of thousands of tokens, the quadratic terms take over and become the bottleneck. This single quadratic is why long-context inference is expensive, why the KV cache matters so much, and why a whole research literature exists on linear and sparse attention approximations. The breakdown here makes the crossover visible: raise the sequence length and watch the share of each layer climb.

For training, the rule of thumb is that a backward pass costs roughly twice a forward pass — it computes gradients with respect to both the inputs and the weights — so a full step is about three times the forward FLOPs. Multiply by the number of tokens and you get the familiar ≈6·N·D heuristic for training a model of N parameters on D tokens. These counts are not a substitute for profiling real hardware, where memory bandwidth, kernel efficiency and overlap matter, but they are the right first-order model for comparing architectures, estimating training budgets, and understanding why context length is the dominant cost lever in modern LLMs.

Common use cases

  • Training budget estimates. Multiply per-step FLOPs by steps to size a training run before committing GPUs.
  • Context-length cost. See exactly how the quadratic s² term takes over as sequences grow long.
  • Architecture comparison. Weigh attention vs. MLP cost, or SwiGLU vs. standard MLP, across model shapes.
  • Inference sizing. Estimate the arithmetic per forward pass and per token for a deployment.

Frequently asked questions

How are the FLOPs counted?

Each matrix multiply is counted as two FLOPs per multiply-add. The tool sums the QKV projections (6·s·d²), the output projection (2·s·d²), the two quadratic attention terms (2·s²·d each), the MLP (2·s·d·d_ff per matmul), and the final logits projection (2·s·d·V). Softmax, normalisation and activations are excluded as they are small by comparison.

Why does attention cost grow with the square of sequence length?

The attention score matrix has one entry per pair of positions, so it is sequence-by-sequence in size, and both computing it and applying it to the values cost on the order of s²·d. Every other term scales linearly with sequence length, so the quadratic part dominates as context grows.

What does the training mode do?

It multiplies the forward-pass FLOPs by three, the standard approximation that a backward pass costs about twice the forward pass (gradients w.r.t. both activations and weights). This gives the familiar ≈6·N·D estimate for training a model of N parameters on D tokens.

What is the difference between SwiGLU and a standard MLP here?

A standard MLP uses two matmuls (up then down projection). SwiGLU/GLU, used in Llama, Mistral and most recent models, uses three because it has a separate gate projection. Selecting it raises the MLP cost by 50% relative to the standard block at the same d_ff.

Will these match what I measure on a GPU?

The arithmetic is exact for the matmuls counted, but real wall-clock time also depends on memory bandwidth, kernel efficiency, achieved utilisation and overlap. Use these figures for relative comparisons and budget estimates, not as a prediction of measured throughput.