BLEU and ROUGE Calculator

Score a generated text against a reference with the two classic metrics for machine translation and summarisation. Enter a candidate and a reference and get BLEU (cumulative BLEU-1 through BLEU-4, with n-gram precisions and the brevity penalty) and ROUGE (ROUGE-1, ROUGE-2 and ROUGE-L, each with precision, recall and F1). Useful for a quick, dependency-free check of n-gram overlap — and it all runs in your browser.

How to use the BLEU and ROUGE Calculator

Paste the candidate text your system generated and the reference you're comparing against. Choose whether to lowercase first (recommended, so capitalisation doesn't count as a mismatch) and the scores update live. BLEU is shown as cumulative BLEU-1 to BLEU-4 — BLEU-4, the geometric mean of the 1- to 4-gram precisions with the brevity penalty, is the figure usually quoted — alongside the individual n-gram precisions and the brevity penalty itself. ROUGE-1, ROUGE-2 and ROUGE-L each report precision, recall and F1.

Tokenisation here is simple whitespace-and-punctuation splitting against a single reference, which is enough for a fast sanity check and for understanding how the metrics respond. For leaderboard-grade numbers, established implementations (sacreBLEU for BLEU, the official ROUGE package for ROUGE) add standardised tokenisation, multiple references and smoothing; treat this tool as an intuition-builder and a quick comparison rather than a substitute for those when you're publishing results.

BLEU and ROUGE: precision vs. recall of n-grams

BLEU and ROUGE are the two metrics that dominated automatic evaluation of text generation for two decades — BLEU born for machine translation, ROUGE for summarisation — and they remain useful baselines even in the LLM era. Both work by counting overlapping n-grams (contiguous runs of n words) between a generated text and one or more references, but they emphasise opposite things, which is the key to reading them correctly.

BLEU is precision-oriented: of the n-grams the candidate produced, how many appear in the reference? It computes this for 1- to 4-grams, clips each n-gram's count so a word repeated more than it appears in the reference can't inflate the score, and takes the geometric mean. On its own, precision would reward a system for emitting a single correct word and nothing else, so BLEU multiplies in a brevity penalty that punishes candidates shorter than the reference. The result is a score from 0 to 1 (often reported ×100) that rises when the candidate uses the right phrases and is roughly the right length. ROUGE flips the emphasis to recall: of the n-grams in the reference, how many did the candidate cover? ROUGE-1 and ROUGE-2 do this for unigrams and bigrams, while ROUGE-L uses the longest common subsequence, rewarding words that appear in the same order even with gaps between them — a softer match well suited to summarisation, where capturing the reference's content matters more than exact phrasing.

Their shared limitation is that they measure surface overlap, not meaning: a perfect paraphrase that reuses none of the reference's words scores poorly, and a fluent-but-wrong sentence sharing many words can score well. That's why modern evaluation supplements them with embedding-based metrics (BERTScore), learned metrics (COMET), and model-graded judging. Still, BLEU and ROUGE are fast, deterministic, interpretable and free of any model dependency, which keeps them valuable for regression-testing a pipeline, comparing systems on the same data, and building intuition about how close a generation is to its target.

Common use cases

  • Translation & summarisation. Score outputs against references the way the literature does.
  • Regression testing. Catch quality drops in a generation pipeline with a fast, deterministic number.
  • Prompt comparison. See which prompt produces output closer to a gold answer.
  • Learning the metrics. Watch how brevity penalty and n-gram clipping change the score.

Frequently asked questions

What's the difference between BLEU and ROUGE?

BLEU is precision-oriented — how much of the candidate appears in the reference — with a brevity penalty to stop short outputs scoring well. ROUGE is recall-oriented — how much of the reference the candidate covers. BLEU was designed for translation, ROUGE for summarisation, and the two are often reported together.

What does the brevity penalty do?

It scales BLEU down when the candidate is shorter than the reference, because precision alone would reward a system for emitting only a few high-confidence words. When the candidate is at least as long as the reference, the penalty is 1 (no effect).

What is ROUGE-L?

ROUGE-L is based on the longest common subsequence — the longest sequence of words appearing in both texts in the same order, allowing gaps. It rewards correct ordering without requiring contiguous matches, which suits summarisation where phrasing varies but content order is preserved.

Will these match sacreBLEU or the official ROUGE exactly?

Not necessarily. This tool uses simple whitespace tokenisation and a single reference, so absolute numbers can differ from standardised implementations that apply specific tokenisation, multiple references and smoothing. Use it for quick comparison and intuition; use sacreBLEU or the official ROUGE for published results.

Should I lowercase the text?

Usually yes — case-insensitive matching stops "The" and "the" counting as different tokens, which is the common convention. Turn it off if capitalisation is meaningful for your comparison.