MoE Active Parameters Calculator
A Mixture-of-Experts model stores many experts but routes each token to only a few, so its total parameter count and its active (per-token) count are very different numbers — and they tell you different things. Enter the number of experts, how many are used per token (top-k), the parameters per expert and the shared/dense parameters, and this calculator shows the total weights you must hold in memory, the active parameters that drive inference cost, the sparsity, and the compute saving versus an equally large dense model. Pick a preset for Mixtral, DeepSeek-V3, Qwen3 or Grok, or enter your own — all in your browser.
How to use the MoE Active Parameters Calculator
Pick a preset to fill in a published model, or enter the four numbers yourself: the total number of experts, how many are activated per token (the router's top-k), the parameters in a single expert, and the shared parameters that every token passes through — attention, embeddings and any always-on dense layers. The calculator computes the total = shared + (experts × params-per-expert) and the active = shared + (top-k × params-per-expert), then shows both figures along with the active fraction, the sparsity, and the compute saving relative to a dense model of the same total size.
Total parameters tell you how much memory (VRAM or system RAM) you need to hold the weights, since every expert must be loaded even though only a few fire per token. Active parameters tell you the inference cost: the FLOPs and memory bandwidth per token scale with the active count, not the total. That gap is the whole point of MoE — you get the knowledge capacity of a huge model at the per-token running cost of a much smaller one. The params-per-expert figure is an approximation that folds the expert's feed-forward matrices into one number; real architectures split it across gate, up and down projections, but the active/total ratio is what matters here.
Total vs. active parameters in a sparse model
A dense transformer runs every parameter for every token. A Mixture-of-Experts model replaces the feed-forward block in some or all layers with a set of parallel "experts" and a small router network that, for each token, picks the top-k experts to run and ignores the rest. Because only a fraction of the experts fire at any position, the model can hold far more parameters — and therefore far more learned knowledge — than it actually computes on for a given token. That decoupling of capacity from compute is why nearly every frontier-scale open model released recently is sparse.
The consequence is two parameter counts that must not be confused. The total count includes every expert and determines the memory footprint: all the weights have to be resident, because the router might select any expert for the next token. The active count is shared + top-k experts, and it governs the arithmetic and the bandwidth per token, so it sets the latency and the cost of generation. Mixtral 8x7B, for instance, holds roughly 47B parameters in total but activates only about 13B per token; DeepSeek-V3 stores hundreds of billions while activing tens of billions. The ratio between the two is the model's sparsity, and it is the lever designers pull to trade memory for speed.
What this means in practice: you provision hardware for the total count but you predict throughput from the active count. A sparse model that needs eight GPUs to hold can still generate at the speed of a model a quarter its size, because most of the weights sit idle on any single step. The trade-offs are real — routing adds complexity, experts can be load-imbalanced, and the full weight set still has to be stored and moved — but for serving large models efficiently the sparse design has become the default. Use this tool to see, for any configuration, exactly how much of the model each token actually touches.
Common use cases
- Memory planning. Use the total parameter count to size the VRAM or RAM needed to hold a sparse model.
- Throughput estimation. Use the active count to predict per-token compute and latency, not the headline size.
- Comparing architectures. See how Mixtral, DeepSeek-V3, Qwen3 and Grok trade total capacity against active cost.
- Designing your own MoE. Sweep expert count and top-k to hit a target active/total ratio.