Transformer Parameter Calculator

Compute the exact parameter count of a decoder-only transformer from its architecture hyperparameters. Enter the vocabulary size, hidden size, layer count, attention and key/value heads, feed-forward width, and whether the embeddings are tied — and the calculator returns the total and a full breakdown across embeddings, attention, and MLP. Presets reproduce the published counts for Llama 3, Mistral, Qwen, and GPT-3. Everything runs in your browser.

Assumes a standard decoder-only transformer with RMSNorm (two per layer, no bias) and no attention output bias. Real models differ in small details (extra norms, MoE experts, learned position embeddings), so the count is exact for this architecture and very close for most modern dense LLMs.

How to use the Transformer Parameter Calculator

Pick a preset to load a known architecture, or choose "Custom" and enter each hyperparameter yourself. The result shows the total parameter count and a breakdown into token embeddings, the output (un-embedding) matrix, attention projections, MLP weights, and normalization parameters, plus the non-embedding count that papers often quote. Toggle GQA by setting fewer key/value heads than attention heads, switch the MLP between gated (SwiGLU, three matrices) and standard (two matrices), and tie the embeddings to share one matrix for input and output.

The presets reproduce published figures — Llama 3 8B returns about 8.03 billion, Mistral 7B about 7.24 billion — so you can trust the formula and then vary one field at a time to see how each choice moves the total.

Where a transformer's parameters live

A decoder-only transformer's parameters fall into three buckets. The embedding matrices map tokens to vectors and back: the input embedding is vocabulary size times hidden size, and unless the model ties weights, the output projection is the same size again. For models with large vocabularies and modest hidden sizes — small models especially — these matrices can dominate the count.

Inside each layer, attention contributes the query, key, value, and output projections. With standard multi-head attention that is four hidden-size-squared matrices; with grouped-query attention (GQA) the key and value projections shrink in proportion to the number of KV heads, saving parameters and, more importantly, key-value cache memory at inference. The MLP block is usually the largest per-layer cost: a gated SwiGLU MLP uses three matrices of hidden size times feed-forward size, while an older GELU MLP uses two. Normalization layers add a negligible couple of vectors per layer.

Summing embeddings plus layers times the per-layer cost gives the total. The "non-embedding" parameter count — everything except the token embedding and un-embedding — is what scaling-law papers typically report, because it is the part that grows with depth and width rather than with vocabulary. This calculator shows both so you can match whichever figure a paper or model card quotes.

Common use cases

  • Verifying a model card. Check whether a claimed parameter count matches the stated architecture.
  • Designing a model. See how depth, width, and FFN ratio trade off against the parameter budget.
  • Understanding GQA savings. Watch the attention parameter count drop as you reduce KV heads.
  • Teaching transformers. Make the embedding-vs-layer parameter split concrete with real numbers.

Frequently asked questions

Why does my count differ slightly from the official one?

Small architectural details vary: some models add extra normalization layers, use learned positional embeddings, include biases, or pad the vocabulary. This calculator models the common modern recipe (RMSNorm, no bias, rotary positions), which matches Llama, Mistral, and Qwen closely. Differences are usually well under one percent.

What does tying embeddings do?

Tied embeddings share a single matrix for the input token embedding and the output projection, removing one vocabulary-by-hidden-size matrix. Small models often tie to save parameters; many larger models keep them separate. Toggle the checkbox to see the difference.

How does GQA change the count?

Grouped-query attention reduces the number of key and value heads, so the K and V projection matrices shrink proportionally. Set KV heads below attention heads to model it — Llama 3 uses 8 KV heads against 32 or 64 attention heads, which trims attention parameters and shrinks the KV cache at inference.

Why is the MLP usually the biggest block?

The feed-forward width is typically 3.5 to 4 times the hidden size, and a gated MLP uses three matrices of hidden size by FFN size. That makes the MLP the largest per-layer contributor in most modern LLMs, ahead of attention.

Does this work for mixture-of-experts models?

Not directly. MoE models replicate the MLP across many experts and route tokens to a few, so total parameters are far higher than active parameters. You can estimate the dense part here, but multiply the MLP block by the number of experts to approximate total MoE parameters.