Speculative Decoding Speedup Calculator

Estimate how much faster speculative decoding makes LLM inference. A small draft model proposes several tokens at once and the large target model verifies them in a single pass, so accepted tokens come almost for free. The actual gain depends on three numbers: the draft's acceptance rate, how many tokens it drafts per step, and how cheap the draft is relative to the target. Enter them and the calculator returns the expected tokens per step, the net wall-clock speedup, and the draft length that maximises throughput. The math runs in your browser.

Expected tokens per step
Cost per step
Net speedup
Effective output speed
Best draft length

Speedup by draft length

How to use the Speculative Decoding Speedup Calculator

Set the three parameters that govern speculative decoding. Acceptance rate α is the probability that the target model accepts a token the draft proposed — a well-matched draft on in-distribution text often lands between 0.6 and 0.8. Draft tokens per step γ is how many tokens the draft generates before each verification. Draft : target cost ratio c is how long one draft forward pass takes relative to one target forward pass; a 1B draft against a 70B target is roughly 0.05–0.15.

The results update instantly. Expected tokens per step is how many tokens you get from one target verification on average — higher acceptance and longer drafts push it up. Cost per step measures the work of one iteration in units of a target forward pass, including the γ draft passes. Net speedup is the ratio of the two: how much faster than ordinary one-token-at-a-time decoding you run. If you fill in the optional target speed, the tool multiplies it by the speedup to give an effective tokens-per-second figure.

The table at the bottom sweeps the draft length from 1 to 10 and marks both your current choice and the length that maximises speedup. Longer is not always better: each extra draft token costs compute whether or not it is accepted, so beyond a point the wasted drafts outweigh the gain. Use the highlighted best row to tune γ for your acceptance rate and cost ratio.

How speculative decoding speedup is calculated

Ordinary autoregressive decoding generates one token per forward pass of the large model, and each pass is memory-bound — the GPU spends most of its time reading the weights, not computing. Speculative decoding exploits the spare compute. A cheap draft model proposes γ tokens; the target model then runs once over all γ proposals in parallel and checks them. Thanks to a rejection-sampling scheme, every accepted token is provably distributed exactly as the target would have produced on its own, so the output quality is identical — only the speed changes.

The expected number of tokens produced per target pass follows directly from the acceptance rate. If each token is accepted independently with probability α, the expected number of accepted tokens before the first rejection, plus one bonus token the target always contributes, is (1 − α^(γ+1)) / (1 − α). With α = 0.8 and γ = 5 that is about 3.4 tokens per verification, versus exactly 1 for plain decoding. The cost of that iteration is one target pass plus γ draft passes, which in target-pass units is γ·c + 1. The net speedup is the expected tokens divided by the cost: (1 − α^(γ+1)) / ((1 − α)(γ·c + 1)).

That formula explains the trade-offs. Raising γ increases the tokens you might accept but also raises the guaranteed draft cost, so there is an optimum that depends on α and c — which is what the table finds. A faster, cheaper draft (lower c) makes longer drafts worthwhile. A better-aligned draft (higher α) raises the ceiling on achievable speedup. In practice 2–3× is typical, with well-tuned setups reaching higher; the gain is real but bounded by how often the small model guesses what the big one would say. This estimate models the standard single-draft scheme and ignores second-order effects like batching, tree-based drafting, and verification overhead.

Common use cases

  • Deciding whether to adopt it. Estimate the speedup for your model pair before investing in a speculative-decoding serving setup.
  • Choosing a draft model. Compare a tiny fast draft (low c, lower α) against a larger closer one (higher c, higher α) to see which wins.
  • Tuning draft length. Find the γ that maximises throughput for your measured acceptance rate instead of guessing.
  • Setting throughput expectations. Convert a known baseline tokens-per-second into the speculative figure for capacity planning.
  • Explaining the technique. Show the acceptance-rate-to-speedup curve to make the cost-benefit concrete for a team.

Frequently asked questions

Does speculative decoding change the model output?

No. The verification step uses a rejection-sampling rule that guarantees accepted tokens follow exactly the target model's distribution. The output is statistically identical to what the target would have generated on its own — speculative decoding only makes it faster, it does not approximate or degrade quality.

What acceptance rate should I expect?

It depends on how well the draft model predicts the target. A draft from the same family and tokenizer on in-distribution text often reaches 0.6–0.8. Out-of-distribution prompts, code, or a poorly matched draft can drop it well below 0.5. The only reliable way to know is to measure on your own traffic, but those ranges are good starting points.

How do I estimate the draft-to-target cost ratio c?

Roughly, it scales with the ratio of the models' active parameter counts and their relative efficiency. A 1B draft against a 70B target is on the order of 0.05–0.15 per forward pass; a 7B draft against a 70B target is higher. Measuring each model's single-token latency on your hardware and dividing gives the most accurate value.

Why is longer draft length not always better?

Every drafted token costs a draft forward pass whether or not it is accepted. Once the acceptance probability of the next token gets low, the extra draft work outweighs the chance of accepting it. That is why the speedup curve rises, peaks, and then falls — the table marks the peak so you can pick the most efficient γ.

Does this account for batching and tree drafting?

No. This models the classic single-sequence, linear-draft scheme from the original speculative-decoding papers. Continuous batching, tree-based or Medusa-style multi-branch drafting, and self-speculative methods change the arithmetic and can do better. Treat the result as a clear baseline for the standard technique rather than an upper bound on every variant.