Transformer FLOPs Calculator
Estimate how much arithmetic a transformer does. Enter the sequence length, model dimension, layer count and feed-forward size and get the total FLOPs for a forward pass — or a training step using the ≈3× forward-plus-backward rule — with a per-component breakdown that separates the attention projections, the quadratic s² score-and-value terms, the MLP and the output logits. It shows how much of the cost is the quadratic attention term that grows with context length, supports SwiGLU/GLU MLPs, and runs entirely in your browser.
How to use the Transformer FLOPs Calculator
Enter the architecture: the sequence length you're processing, the model (hidden) dimension d, the number of transformer layers, the feed-forward width d_ff (leave it blank to default to 4·d), the vocabulary size for the final projection, and a batch size. Choose whether the MLP is a SwiGLU/GLU block — three matmuls, as in Llama and Mistral — or a standard two-matmul MLP, and whether you want the cost of a forward pass alone or a full training step. The breakdown updates immediately: total FLOPs, cost per forward pass, attention and MLP per layer, cost per token, and the logits projection.
The estimate counts each multiply-add as two FLOPs and sums the standard matrix-multiply costs: the QKV and output projections (6·s·d² and 2·s·d² per layer), the two quadratic attention terms (2·s²·d each, for the scores and the value aggregation), and the MLP (2·s·d·d_ff per matmul). Training multiplies the forward cost by three to approximate the backward pass. The figures cover the dense matmul arithmetic that dominates real runtime; they exclude softmax, normalisation, activation functions and elementwise operations, which are comparatively cheap, so treat the result as a tight lower bound for planning rather than an exact hardware counter.
Where a transformer spends its FLOPs
Almost all of a transformer's arithmetic is matrix multiplication, and counting it tells you most of what you need to know about cost. Each layer has two halves. The attention block projects the input into queries, keys and values (three d×d matmuls), computes a score matrix by multiplying queries against keys, uses it to take a weighted sum of values, and projects the result back out. The MLP block expands the dimension to a wider feed-forward size and contracts it again — two matmuls for a classic MLP, three for the gated SwiGLU variant used in most modern models. Stack that over many layers, add a final projection from the hidden state to vocabulary-sized logits, and you have the whole forward pass.
The crucial structural fact is that two of those terms scale with the square of the sequence length. The score matrix is sequence-by-sequence, and so is the value aggregation, so their cost grows as s²·d while everything else grows linearly in s. At short context the linear projection and MLP terms dominate and attention looks cheap; as the context grows into the tens of thousands of tokens, the quadratic terms take over and become the bottleneck. This single quadratic is why long-context inference is expensive, why the KV cache matters so much, and why a whole research literature exists on linear and sparse attention approximations. The breakdown here makes the crossover visible: raise the sequence length and watch the s² share of each layer climb.
For training, the rule of thumb is that a backward pass costs roughly twice a forward pass — it computes gradients with respect to both the inputs and the weights — so a full step is about three times the forward FLOPs. Multiply by the number of tokens and you get the familiar ≈6·N·D heuristic for training a model of N parameters on D tokens. These counts are not a substitute for profiling real hardware, where memory bandwidth, kernel efficiency and overlap matter, but they are the right first-order model for comparing architectures, estimating training budgets, and understanding why context length is the dominant cost lever in modern LLMs.
Common use cases
- Training budget estimates. Multiply per-step FLOPs by steps to size a training run before committing GPUs.
- Context-length cost. See exactly how the quadratic s² term takes over as sequences grow long.
- Architecture comparison. Weigh attention vs. MLP cost, or SwiGLU vs. standard MLP, across model shapes.
- Inference sizing. Estimate the arithmetic per forward pass and per token for a deployment.