Tokens per Second Calculator

Estimate how fast a model will generate text on your hardware. For single-stream decoding, an LLM is memory-bandwidth bound: every token requires reading the entire set of active weights from memory once, so tokens per second is roughly bandwidth divided by model size. Pick a GPU, set the model size and quantization, and get a realistic tokens-per-second estimate. All math is local.

Weights read per token
Theoretical max (100%)
Realistic tokens/sec
~Words per second

How to use the Tokens per Second Calculator

Choose your GPU to fill in its memory bandwidth, or pick Custom and type the figure from the spec sheet. Set the model's parameter count and the quantization you will run it at — together these give the number of bytes that must be read from memory for each generated token. The realistic tokens/sec figure applies an efficiency factor (default 80%) because no kernel achieves 100% of peak bandwidth.

If you are running a mixture-of-experts model, enter its active parameter count too — MoE models only read the active experts per token, so a 47B model with 13B active runs at roughly the speed of a 13B dense model. Leave it at 0 for a normal dense model. The result is the decode (generation) speed for a single request; prompt processing and batched serving behave differently, as explained below.

Why inference speed is bandwidth-bound

When a transformer generates one token during decoding, it must multiply the input by every weight matrix in the model. At batch size 1 each weight is used exactly once, so the work is dominated not by arithmetic but by the time it takes to read the weights out of memory. That makes single-stream generation memory-bandwidth bound: the GPU's compute units sit partly idle waiting for data. The simple, surprisingly accurate model is tokens/sec ≈ memory_bandwidth / bytes_read_per_token, where the bytes read are the model size in your chosen quantization.

This is why quantization speeds models up as well as shrinking them: a 4-bit model is a quarter the bytes of FP16, so it generates roughly four times faster on the same card. It is also why a Mac with unified memory can be competitive for inference despite modest compute — what matters is the 400-800 GB/s of bandwidth, not the TFLOPS. And it explains mixture-of-experts: an MoE only activates a fraction of its weights per token, so it reads far less memory and runs at the speed of its active-parameter count, not its total size.

The efficiency factor accounts for the gap between a card's advertised peak bandwidth and what real kernels sustain — typically 70-85% for well-optimised inference. Two important caveats: this estimates decode speed (generating new tokens), which is what you feel as "typing speed". Prompt processing (reading your input) is compute-bound and much faster per token. And batched serving reads each weight once for many requests at the same time, so throughput per GPU rises far above this single-stream number while each individual user still sees roughly this latency.

Common use cases

  • Setting expectations before you buy. See whether a card will feel snappy or sluggish for the model you want to run, before spending money.
  • Comparing quantizations. Watch tokens per second roughly quadruple moving from FP16 to 4-bit on the same GPU.
  • Evaluating Apple Silicon. Compare a Mac's unified-memory bandwidth against a discrete GPU for local inference.
  • Sizing MoE models. Estimate the real speed of a mixture-of-experts model from its active parameters, not its full size.

Frequently asked questions

Why tokens per second and not FLOPS?

During single-stream decoding the GPU spends most of its time reading weights from memory, not doing arithmetic. So the bottleneck is memory bandwidth, and a bandwidth-based estimate predicts real generation speed far better than a compute (FLOPS) figure does.

My real speed is higher than this. Why?

A few reasons: speculative decoding, flash-attention, and an efficiency above the default 80% all help. Batched serving also reads each weight once for many requests, so aggregate throughput far exceeds this single-stream estimate. Raise the efficiency slider to match a benchmark you trust.

Does this include prompt processing time?

No. This estimates decode speed — generating new tokens. Reading your prompt (prefill) is compute-bound and processes many tokens in parallel, so it is much faster per token. Total latency is prefill time plus output tokens divided by this rate.

How do I handle a mixture-of-experts model?

Enter the active parameter count in the MoE field. An MoE only reads its active experts per token, so a 47B model with ~13B active reads memory like a 13B dense model and runs at roughly that speed, even though it occupies far more VRAM.

What efficiency should I use?

Real inference kernels sustain about 70-85% of a GPU's advertised peak bandwidth, so 80% is a reasonable default. If you have benchmarked your exact setup, set the slider to whatever reproduces your measured tokens per second.