BLEU and ROUGE Calculator
Score a generated text against a reference with the two classic metrics for machine translation and summarisation. Enter a candidate and a reference and get BLEU (cumulative BLEU-1 through BLEU-4, with n-gram precisions and the brevity penalty) and ROUGE (ROUGE-1, ROUGE-2 and ROUGE-L, each with precision, recall and F1). Useful for a quick, dependency-free check of n-gram overlap — and it all runs in your browser.
How to use the BLEU and ROUGE Calculator
Paste the candidate text your system generated and the reference you're comparing against. Choose whether to lowercase first (recommended, so capitalisation doesn't count as a mismatch) and the scores update live. BLEU is shown as cumulative BLEU-1 to BLEU-4 — BLEU-4, the geometric mean of the 1- to 4-gram precisions with the brevity penalty, is the figure usually quoted — alongside the individual n-gram precisions and the brevity penalty itself. ROUGE-1, ROUGE-2 and ROUGE-L each report precision, recall and F1.
Tokenisation here is simple whitespace-and-punctuation splitting against a single reference, which is enough for a fast sanity check and for understanding how the metrics respond. For leaderboard-grade numbers, established implementations (sacreBLEU for BLEU, the official ROUGE package for ROUGE) add standardised tokenisation, multiple references and smoothing; treat this tool as an intuition-builder and a quick comparison rather than a substitute for those when you're publishing results.
BLEU and ROUGE: precision vs. recall of n-grams
BLEU and ROUGE are the two metrics that dominated automatic evaluation of text generation for two decades — BLEU born for machine translation, ROUGE for summarisation — and they remain useful baselines even in the LLM era. Both work by counting overlapping n-grams (contiguous runs of n words) between a generated text and one or more references, but they emphasise opposite things, which is the key to reading them correctly.
BLEU is precision-oriented: of the n-grams the candidate produced, how many appear in the reference? It computes this for 1- to 4-grams, clips each n-gram's count so a word repeated more than it appears in the reference can't inflate the score, and takes the geometric mean. On its own, precision would reward a system for emitting a single correct word and nothing else, so BLEU multiplies in a brevity penalty that punishes candidates shorter than the reference. The result is a score from 0 to 1 (often reported ×100) that rises when the candidate uses the right phrases and is roughly the right length. ROUGE flips the emphasis to recall: of the n-grams in the reference, how many did the candidate cover? ROUGE-1 and ROUGE-2 do this for unigrams and bigrams, while ROUGE-L uses the longest common subsequence, rewarding words that appear in the same order even with gaps between them — a softer match well suited to summarisation, where capturing the reference's content matters more than exact phrasing.
Their shared limitation is that they measure surface overlap, not meaning: a perfect paraphrase that reuses none of the reference's words scores poorly, and a fluent-but-wrong sentence sharing many words can score well. That's why modern evaluation supplements them with embedding-based metrics (BERTScore), learned metrics (COMET), and model-graded judging. Still, BLEU and ROUGE are fast, deterministic, interpretable and free of any model dependency, which keeps them valuable for regression-testing a pipeline, comparing systems on the same data, and building intuition about how close a generation is to its target.
Common use cases
- Translation & summarisation. Score outputs against references the way the literature does.
- Regression testing. Catch quality drops in a generation pipeline with a fast, deterministic number.
- Prompt comparison. See which prompt produces output closer to a gold answer.
- Learning the metrics. Watch how brevity penalty and n-gram clipping change the score.