Transformer Parameter Calculator
Compute the exact parameter count of a decoder-only transformer from its architecture hyperparameters. Enter the vocabulary size, hidden size, layer count, attention and key/value heads, feed-forward width, and whether the embeddings are tied — and the calculator returns the total and a full breakdown across embeddings, attention, and MLP. Presets reproduce the published counts for Llama 3, Mistral, Qwen, and GPT-3. Everything runs in your browser.
Assumes a standard decoder-only transformer with RMSNorm (two per layer, no bias) and no attention output bias. Real models differ in small details (extra norms, MoE experts, learned position embeddings), so the count is exact for this architecture and very close for most modern dense LLMs.
How to use the Transformer Parameter Calculator
Pick a preset to load a known architecture, or choose "Custom" and enter each hyperparameter yourself. The result shows the total parameter count and a breakdown into token embeddings, the output (un-embedding) matrix, attention projections, MLP weights, and normalization parameters, plus the non-embedding count that papers often quote. Toggle GQA by setting fewer key/value heads than attention heads, switch the MLP between gated (SwiGLU, three matrices) and standard (two matrices), and tie the embeddings to share one matrix for input and output.
The presets reproduce published figures — Llama 3 8B returns about 8.03 billion, Mistral 7B about 7.24 billion — so you can trust the formula and then vary one field at a time to see how each choice moves the total.
Where a transformer's parameters live
A decoder-only transformer's parameters fall into three buckets. The embedding matrices map tokens to vectors and back: the input embedding is vocabulary size times hidden size, and unless the model ties weights, the output projection is the same size again. For models with large vocabularies and modest hidden sizes — small models especially — these matrices can dominate the count.
Inside each layer, attention contributes the query, key, value, and output projections. With standard multi-head attention that is four hidden-size-squared matrices; with grouped-query attention (GQA) the key and value projections shrink in proportion to the number of KV heads, saving parameters and, more importantly, key-value cache memory at inference. The MLP block is usually the largest per-layer cost: a gated SwiGLU MLP uses three matrices of hidden size times feed-forward size, while an older GELU MLP uses two. Normalization layers add a negligible couple of vectors per layer.
Summing embeddings plus layers times the per-layer cost gives the total. The "non-embedding" parameter count — everything except the token embedding and un-embedding — is what scaling-law papers typically report, because it is the part that grows with depth and width rather than with vocabulary. This calculator shows both so you can match whichever figure a paper or model card quotes.
Common use cases
- Verifying a model card. Check whether a claimed parameter count matches the stated architecture.
- Designing a model. See how depth, width, and FFN ratio trade off against the parameter budget.
- Understanding GQA savings. Watch the attention parameter count drop as you reduce KV heads.
- Teaching transformers. Make the embedding-vs-layer parameter split concrete with real numbers.