Logprobs Visualizer

Turn a model's logprobs into something you can read at a glance. Paste the logprobs object an LLM API returns — OpenAI chat or legacy completions format — or one token logprob per line, and get a colour-coded heatmap where green tokens are high-probability and red ones are where the model hesitated, plus the per-token probabilities, the sequence perplexity, and a list of the least-confident tokens. It's the fastest way to see where a generation was uncertain rather than just how good it was overall — and it runs in your browser.

How to use the Logprobs Visualizer

Paste the logprobs your API call returned. The tool understands the OpenAI chat shape (a content array of {token, logprob, top_logprobs}, whether or not it's nested under choices[0].logprobs), the legacy completions shape (parallel tokens and token_logprobs arrays), a bare array of token/logprob objects, or plain text with one token logprob per line. It converts each log-probability back to a probability with p = e^logprob and renders the tokens as a continuous strip, each shaded from red through green by its probability, with whitespace and newlines shown as visible glyphs so the layout stays readable.

Above the heatmap you get the summary numbers: the token count, the average log-probability, the total log-probability of the sequence, and the perplexity — the exponential of the negative average log-probability, which reads as the effective number of choices the model felt it had per token. Below it, the lowest-confidence tokens are pulled out so you can jump straight to the moments of hesitation. Hover any token to see its exact probability and, when the data includes them, the top alternative tokens the model considered at that position. Nothing leaves your browser, so it's safe to paste real API responses.

What logprobs tell you about a generation

When a language model emits a token it also knows the probability it assigned to that token, and most APIs will return it as a log-probability — the natural log of the probability, always negative, with values close to zero meaning near-certainty and large negative values meaning the model was reaching. Logprobs are logged rather than raw probabilities because the numbers involved span many orders of magnitude and because adding logs is the convenient way to get the probability of a whole sequence. They are the most direct, token-level window into a model's confidence that an API exposes.

Visualised as a heatmap, that stream of numbers becomes diagnostic. A confident generation is mostly green with the occasional amber token at genuine choice points — a name, a number, the start of a clause. A hallucination or a wrong turn often shows up as a cluster of red: the model committed to something it had little probability mass behind. This is why logprobs underpin so many practical techniques. Confidence filtering flags low-probability spans for review; perplexity, the exponentiated average negative log-probability, summarises how surprised the model was by the whole text in a single comparable number; and inspecting the top alternatives at each position shows what the model nearly said instead, which is invaluable when debugging classification prompts or constrained outputs.

The important caveat is that confidence is not correctness. A model can be fluently, greenly confident and still be wrong, and it can be redly uncertain about a token that turns out fine — logprobs measure the model's own probability estimates, not ground truth. But within those limits they are one of the cheapest and most useful signals available: no extra model calls, no labels, just the numbers the generation already produced. Reading them well — spotting where the red appears, comparing perplexity across prompts, checking the alternatives at a fork — is a core skill for anyone evaluating or debugging LLM output, and this tool exists to make that reading immediate.

Common use cases

  • Spotting hallucinations. Find the red clusters where a model committed to low-probability tokens.
  • Prompt debugging. See which tokens a prompt makes the model unsure about, and what it nearly said instead.
  • Confidence filtering. Identify low-confidence spans worth flagging for human review.
  • Comparing outputs. Use perplexity to rank generations or prompts on how confident the model was.

Frequently asked questions

What logprobs formats does it accept?

The OpenAI chat format (a content array of {token, logprob, top_logprobs}, nested under choices[0].logprobs or not), the legacy completions format (parallel tokens and token_logprobs arrays), a bare array of {token, logprob} objects, and plain text with one "tokenlogprob" or "token logprob" per line.

How is a log-probability turned into a probability?

By exponentiating: p = e^logprob. Logprobs are natural logs, so a logprob of 0 is probability 1 (certainty), −0.69 is about 50%, and −4.6 is about 1%. The heatmap colours each token by this probability, green for high and red for low.

What is perplexity here?

Perplexity is the exponential of the negative average log-probability across the tokens: exp(−mean logprob). It reads as the effective number of equally-likely options the model felt it was choosing between per token — lower is more confident. It is a single-number summary of the whole sequence.

Does high confidence mean the output is correct?

No. Logprobs reflect the model's own probability estimates, not ground truth. A model can be confidently wrong or needlessly uncertain about a correct token. Use confidence as a signal for where to look, not as a guarantee of accuracy.

Is my data uploaded?

No. Parsing, the probability conversion and all rendering happen in your browser. You can safely paste real API responses — nothing is sent to a server or stored.