Perplexity Calculator

Compute perplexity from a list of per-token log-probabilities. Paste the log-probs your model returned — in natural log, log2 or log10, or as raw probabilities — and get the perplexity along with cross-entropy in both nats and bits, the average and total log-probability, and the overall sequence probability. It's the standard intrinsic measure of how well a language model predicts a text, and everything is calculated in your browser.

How to use the Perplexity Calculator

Paste one value per token. Most APIs return log-probabilities in natural log, so that's the default — but if yours uses log2 or log10, switch the log base and the maths adjusts. If you have raw probabilities between 0 and 1 instead, choose "Probabilities" and they'll be converted. The result updates as you type: perplexity, cross-entropy in nats and bits per token, the average and total log-probability, and the probability of the whole sequence.

Perplexity is defined as the exponential of the average negative log-probability, so a lower number is better — it's the effective number of equally-likely options the model was choosing between at each step. A perplexity of 1 means perfect certainty; a value equal to your vocabulary size means the model did no better than random. Comparing perplexity is only meaningful between models that share the same tokenizer and are measured on the same text, since both the token count and the segmentation affect the score.

What perplexity measures

Perplexity is the most common intrinsic metric for a language model: a single number summarising how surprised the model is by a piece of text. Formally it's the exponential of the cross-entropy between the model's predicted distribution and the actual tokens — perplexity = exp(−(1/N) Σ ln p(tokeni)) — where each p(token) is the probability the model assigned to the token that actually occurred. Because it's an exponential of an average log-probability, it has an intuitive reading: a perplexity of 20 means the model was, on average, as uncertain as if it had been choosing uniformly among 20 possibilities at each position.

The connection to cross-entropy is why perplexity is reported in two related ways. Cross-entropy measured in nats (natural log) or bits (log base 2) is the average surprise per token; perplexity is simply that quantity exponentiated, turning an additive information measure into a multiplicative "branching factor". Bits-per-token and perplexity carry the same information — perplexity = 2bits — so you'll see both in papers and dashboards depending on convention. Lower cross-entropy means the probability mass landed on the right tokens, which means lower perplexity.

The important caveat is comparability. Perplexity depends on the tokenizer, because a model that splits text into more, smaller tokens spreads the probability differently than one using larger units, and on the exact text evaluated. That makes cross-model perplexity comparisons valid only when the tokenizer and evaluation set are held fixed — which is why benchmark suites pin both. Used carefully, though, perplexity is an efficient, label-free way to track whether a model (or a fine-tune, or a quantization) is getting better or worse at modelling your data, long before you run more expensive task-specific evaluations.

Common use cases

  • Model comparison. Rank checkpoints or fine-tunes on a held-out set when the tokenizer is fixed.
  • Quantization checks. Measure how much perplexity rises after 8-bit or 4-bit quantization.
  • Domain fit. See how well a base model predicts text from your specific domain before fine-tuning.
  • Sanity checks. Turn a stream of API log-probs into an interpretable quality number.

Frequently asked questions

What input does it expect?

One value per token, either as log-probabilities (the log of the probability the model gave each actual token) or as raw probabilities between 0 and 1. Log-probabilities are what most APIs return; choose the matching log base (natural, log2 or log10) so the conversion is correct.

Why is lower perplexity better?

Perplexity is the exponential of the model's average negative log-probability on the text, which you can read as the effective number of choices it was deciding between per token. A perfect model that always assigned probability 1 to the right token would have a perplexity of 1; higher numbers mean more uncertainty.

What's the relationship between perplexity and cross-entropy?

They're the same information in different units. Cross-entropy is the average negative log-probability per token (in nats for natural log, bits for log2); perplexity is that value exponentiated. So perplexity equals e^(cross-entropy in nats) and equals 2^(cross-entropy in bits).

Can I compare perplexity across different models?

Only if they use the same tokenizer and you evaluate on the same text. Perplexity depends on how text is split into tokens and on the specific dataset, so comparisons are valid only when both are held constant — otherwise you're comparing different things.

Is my data sent to a server?

No. The calculation is pure arithmetic done entirely in your browser. Nothing you paste is uploaded or stored.