Perplexity Calculator
Compute perplexity from a list of per-token log-probabilities. Paste the log-probs your model returned — in natural log, log2 or log10, or as raw probabilities — and get the perplexity along with cross-entropy in both nats and bits, the average and total log-probability, and the overall sequence probability. It's the standard intrinsic measure of how well a language model predicts a text, and everything is calculated in your browser.
How to use the Perplexity Calculator
Paste one value per token. Most APIs return log-probabilities in natural log, so that's the default — but if yours uses log2 or log10, switch the log base and the maths adjusts. If you have raw probabilities between 0 and 1 instead, choose "Probabilities" and they'll be converted. The result updates as you type: perplexity, cross-entropy in nats and bits per token, the average and total log-probability, and the probability of the whole sequence.
Perplexity is defined as the exponential of the average negative log-probability, so a lower number is better — it's the effective number of equally-likely options the model was choosing between at each step. A perplexity of 1 means perfect certainty; a value equal to your vocabulary size means the model did no better than random. Comparing perplexity is only meaningful between models that share the same tokenizer and are measured on the same text, since both the token count and the segmentation affect the score.
What perplexity measures
Perplexity is the most common intrinsic metric for a language model: a single number summarising how surprised the model is by a piece of text. Formally it's the exponential of the cross-entropy between the model's predicted distribution and the actual tokens — perplexity = exp(−(1/N) Σ ln p(tokeni)) — where each p(token) is the probability the model assigned to the token that actually occurred. Because it's an exponential of an average log-probability, it has an intuitive reading: a perplexity of 20 means the model was, on average, as uncertain as if it had been choosing uniformly among 20 possibilities at each position.
The connection to cross-entropy is why perplexity is reported in two related ways. Cross-entropy measured in nats (natural log) or bits (log base 2) is the average surprise per token; perplexity is simply that quantity exponentiated, turning an additive information measure into a multiplicative "branching factor". Bits-per-token and perplexity carry the same information — perplexity = 2bits — so you'll see both in papers and dashboards depending on convention. Lower cross-entropy means the probability mass landed on the right tokens, which means lower perplexity.
The important caveat is comparability. Perplexity depends on the tokenizer, because a model that splits text into more, smaller tokens spreads the probability differently than one using larger units, and on the exact text evaluated. That makes cross-model perplexity comparisons valid only when the tokenizer and evaluation set are held fixed — which is why benchmark suites pin both. Used carefully, though, perplexity is an efficient, label-free way to track whether a model (or a fine-tune, or a quantization) is getting better or worse at modelling your data, long before you run more expensive task-specific evaluations.
Common use cases
- Model comparison. Rank checkpoints or fine-tunes on a held-out set when the tokenizer is fixed.
- Quantization checks. Measure how much perplexity rises after 8-bit or 4-bit quantization.
- Domain fit. See how well a base model predicts text from your specific domain before fine-tuning.
- Sanity checks. Turn a stream of API log-probs into an interpretable quality number.