Speculative Decoding Speedup Calculator
Estimate how much faster speculative decoding makes LLM inference. A small draft model proposes several tokens at once and the large target model verifies them in a single pass, so accepted tokens come almost for free. The actual gain depends on three numbers: the draft's acceptance rate, how many tokens it drafts per step, and how cheap the draft is relative to the target. Enter them and the calculator returns the expected tokens per step, the net wall-clock speedup, and the draft length that maximises throughput. The math runs in your browser.
| Expected tokens per step | |
| Cost per step | |
| Net speedup | |
| Effective output speed | |
| Best draft length |
Speedup by draft length
How to use the Speculative Decoding Speedup Calculator
Set the three parameters that govern speculative decoding. Acceptance rate α is the probability that the target model accepts a token the draft proposed — a well-matched draft on in-distribution text often lands between 0.6 and 0.8. Draft tokens per step γ is how many tokens the draft generates before each verification. Draft : target cost ratio c is how long one draft forward pass takes relative to one target forward pass; a 1B draft against a 70B target is roughly 0.05–0.15.
The results update instantly. Expected tokens per step is how many tokens you get from one target verification on average — higher acceptance and longer drafts push it up. Cost per step measures the work of one iteration in units of a target forward pass, including the γ draft passes. Net speedup is the ratio of the two: how much faster than ordinary one-token-at-a-time decoding you run. If you fill in the optional target speed, the tool multiplies it by the speedup to give an effective tokens-per-second figure.
The table at the bottom sweeps the draft length from 1 to 10 and marks both your current choice and the length that maximises speedup. Longer is not always better: each extra draft token costs compute whether or not it is accepted, so beyond a point the wasted drafts outweigh the gain. Use the highlighted best row to tune γ for your acceptance rate and cost ratio.
How speculative decoding speedup is calculated
Ordinary autoregressive decoding generates one token per forward pass of the large model, and each pass is memory-bound — the GPU spends most of its time reading the weights, not computing. Speculative decoding exploits the spare compute. A cheap draft model proposes γ tokens; the target model then runs once over all γ proposals in parallel and checks them. Thanks to a rejection-sampling scheme, every accepted token is provably distributed exactly as the target would have produced on its own, so the output quality is identical — only the speed changes.
The expected number of tokens produced per target pass follows directly from the acceptance rate. If each token is accepted independently with probability α, the expected number of accepted tokens before the first rejection, plus one bonus token the target always contributes, is (1 − α^(γ+1)) / (1 − α). With α = 0.8 and γ = 5 that is about 3.4 tokens per verification, versus exactly 1 for plain decoding. The cost of that iteration is one target pass plus γ draft passes, which in target-pass units is γ·c + 1. The net speedup is the expected tokens divided by the cost: (1 − α^(γ+1)) / ((1 − α)(γ·c + 1)).
That formula explains the trade-offs. Raising γ increases the tokens you might accept but also raises the guaranteed draft cost, so there is an optimum that depends on α and c — which is what the table finds. A faster, cheaper draft (lower c) makes longer drafts worthwhile. A better-aligned draft (higher α) raises the ceiling on achievable speedup. In practice 2–3× is typical, with well-tuned setups reaching higher; the gain is real but bounded by how often the small model guesses what the big one would say. This estimate models the standard single-draft scheme and ignores second-order effects like batching, tree-based drafting, and verification overhead.
Common use cases
- Deciding whether to adopt it. Estimate the speedup for your model pair before investing in a speculative-decoding serving setup.
- Choosing a draft model. Compare a tiny fast draft (low c, lower α) against a larger closer one (higher c, higher α) to see which wins.
- Tuning draft length. Find the γ that maximises throughput for your measured acceptance rate instead of guessing.
- Setting throughput expectations. Convert a known baseline tokens-per-second into the speculative figure for capacity planning.
- Explaining the technique. Show the acceptance-rate-to-speedup curve to make the cost-benefit concrete for a team.