MoE Active Parameters Calculator

A Mixture-of-Experts model stores many experts but routes each token to only a few, so its total parameter count and its active (per-token) count are very different numbers — and they tell you different things. Enter the number of experts, how many are used per token (top-k), the parameters per expert and the shared/dense parameters, and this calculator shows the total weights you must hold in memory, the active parameters that drive inference cost, the sparsity, and the compute saving versus an equally large dense model. Pick a preset for Mixtral, DeepSeek-V3, Qwen3 or Grok, or enter your own — all in your browser.

How to use the MoE Active Parameters Calculator

Pick a preset to fill in a published model, or enter the four numbers yourself: the total number of experts, how many are activated per token (the router's top-k), the parameters in a single expert, and the shared parameters that every token passes through — attention, embeddings and any always-on dense layers. The calculator computes the total = shared + (experts × params-per-expert) and the active = shared + (top-k × params-per-expert), then shows both figures along with the active fraction, the sparsity, and the compute saving relative to a dense model of the same total size.

Total parameters tell you how much memory (VRAM or system RAM) you need to hold the weights, since every expert must be loaded even though only a few fire per token. Active parameters tell you the inference cost: the FLOPs and memory bandwidth per token scale with the active count, not the total. That gap is the whole point of MoE — you get the knowledge capacity of a huge model at the per-token running cost of a much smaller one. The params-per-expert figure is an approximation that folds the expert's feed-forward matrices into one number; real architectures split it across gate, up and down projections, but the active/total ratio is what matters here.

Total vs. active parameters in a sparse model

A dense transformer runs every parameter for every token. A Mixture-of-Experts model replaces the feed-forward block in some or all layers with a set of parallel "experts" and a small router network that, for each token, picks the top-k experts to run and ignores the rest. Because only a fraction of the experts fire at any position, the model can hold far more parameters — and therefore far more learned knowledge — than it actually computes on for a given token. That decoupling of capacity from compute is why nearly every frontier-scale open model released recently is sparse.

The consequence is two parameter counts that must not be confused. The total count includes every expert and determines the memory footprint: all the weights have to be resident, because the router might select any expert for the next token. The active count is shared + top-k experts, and it governs the arithmetic and the bandwidth per token, so it sets the latency and the cost of generation. Mixtral 8x7B, for instance, holds roughly 47B parameters in total but activates only about 13B per token; DeepSeek-V3 stores hundreds of billions while activing tens of billions. The ratio between the two is the model's sparsity, and it is the lever designers pull to trade memory for speed.

What this means in practice: you provision hardware for the total count but you predict throughput from the active count. A sparse model that needs eight GPUs to hold can still generate at the speed of a model a quarter its size, because most of the weights sit idle on any single step. The trade-offs are real — routing adds complexity, experts can be load-imbalanced, and the full weight set still has to be stored and moved — but for serving large models efficiently the sparse design has become the default. Use this tool to see, for any configuration, exactly how much of the model each token actually touches.

Common use cases

  • Memory planning. Use the total parameter count to size the VRAM or RAM needed to hold a sparse model.
  • Throughput estimation. Use the active count to predict per-token compute and latency, not the headline size.
  • Comparing architectures. See how Mixtral, DeepSeek-V3, Qwen3 and Grok trade total capacity against active cost.
  • Designing your own MoE. Sweep expert count and top-k to hit a target active/total ratio.

Frequently asked questions

What is the difference between total and active parameters?

Total is every parameter in the model, including all experts — it sets the memory needed to hold the weights. Active is the parameters used for a single token: the shared layers plus the top-k experts the router selects. Active drives the per-token compute and latency, while total drives the storage footprint.

How is the active count calculated?

Active = shared parameters + (experts-per-token × parameters-per-expert). The shared part (attention, embeddings, any always-on dense layers) runs for every token; on top of it only the top-k routed experts fire, so the active count is far below the total when there are many experts.

Why does a MoE model still need so much memory if it only uses a few experts per token?

Because the router can pick any expert for the next token, all experts must be resident in memory. You save compute and bandwidth per token, but not storage — the full weight set is always loaded, which is why total parameters determine your hardware footprint.

What does "params per expert" mean here?

It is an approximation of the parameters in one expert's feed-forward block. Real models split this across gate, up and down projection matrices, and only the MLP is expert-specific while attention is shared. The tool folds the expert MLP into a single figure because the total-vs-active ratio is what the calculation depends on.

Are the preset numbers exact?

They use published or widely reported figures for each model and are accurate enough for capacity and cost planning. Exact counts vary slightly with how shared layers, router parameters and embeddings are tallied, so treat the totals as close estimates rather than to-the-parameter values.
Embed this tool on your site

Free to embed, no attribution required (but appreciated). Paste this where you want the tool to appear: