Tokens per Second Calculator
Estimate how fast a model will generate text on your hardware. For single-stream decoding, an LLM is memory-bandwidth bound: every token requires reading the entire set of active weights from memory once, so tokens per second is roughly bandwidth divided by model size. Pick a GPU, set the model size and quantization, and get a realistic tokens-per-second estimate. All math is local.
| Weights read per token | |
| Theoretical max (100%) | |
| Realistic tokens/sec | |
| ~Words per second |
How to use the Tokens per Second Calculator
Choose your GPU to fill in its memory bandwidth, or pick Custom and type the figure from the spec sheet. Set the model's parameter count and the quantization you will run it at — together these give the number of bytes that must be read from memory for each generated token. The realistic tokens/sec figure applies an efficiency factor (default 80%) because no kernel achieves 100% of peak bandwidth.
If you are running a mixture-of-experts model, enter its active parameter count too — MoE models only read the active experts per token, so a 47B model with 13B active runs at roughly the speed of a 13B dense model. Leave it at 0 for a normal dense model. The result is the decode (generation) speed for a single request; prompt processing and batched serving behave differently, as explained below.
Why inference speed is bandwidth-bound
When a transformer generates one token during decoding, it must multiply the input by every weight matrix in the model. At batch size 1 each weight is used exactly once, so the work is dominated not by arithmetic but by the time it takes to read the weights out of memory. That makes single-stream generation memory-bandwidth bound: the GPU's compute units sit partly idle waiting for data. The simple, surprisingly accurate model is tokens/sec ≈ memory_bandwidth / bytes_read_per_token, where the bytes read are the model size in your chosen quantization.
This is why quantization speeds models up as well as shrinking them: a 4-bit model is a quarter the bytes of FP16, so it generates roughly four times faster on the same card. It is also why a Mac with unified memory can be competitive for inference despite modest compute — what matters is the 400-800 GB/s of bandwidth, not the TFLOPS. And it explains mixture-of-experts: an MoE only activates a fraction of its weights per token, so it reads far less memory and runs at the speed of its active-parameter count, not its total size.
The efficiency factor accounts for the gap between a card's advertised peak bandwidth and what real kernels sustain — typically 70-85% for well-optimised inference. Two important caveats: this estimates decode speed (generating new tokens), which is what you feel as "typing speed". Prompt processing (reading your input) is compute-bound and much faster per token. And batched serving reads each weight once for many requests at the same time, so throughput per GPU rises far above this single-stream number while each individual user still sees roughly this latency.
Common use cases
- Setting expectations before you buy. See whether a card will feel snappy or sluggish for the model you want to run, before spending money.
- Comparing quantizations. Watch tokens per second roughly quadruple moving from FP16 to 4-bit on the same GPU.
- Evaluating Apple Silicon. Compare a Mac's unified-memory bandwidth against a discrete GPU for local inference.
- Sizing MoE models. Estimate the real speed of a mixture-of-experts model from its active parameters, not its full size.