Confusion Matrix Calculator

Paste your actual,predicted label pairs and get a full confusion matrix with the metrics that go with it: overall accuracy, per-class precision, recall and F1, and macro, micro and weighted averages. It works for binary or multi-class problems with any label names, builds the matrix automatically, and computes everything in your browser.

Separators: comma, tab or semicolon. Labels can be any text.

How to use the Confusion Matrix Calculator

Provide one example per line as the true label, a separator (comma, tab or semicolon), then the predicted label — for instance spam,ham. The tool collects every distinct label, sorts them, and tallies each (actual, predicted) combination into the matrix, where the diagonal holds correct predictions and off-diagonal cells show exactly which classes are being confused for which. Cells are shaded so the mistakes stand out at a glance.

Below the matrix you get the metrics. Precision for a class is how many of the items predicted as that class were right; recall is how many of the actual members of that class were found; F1 is their harmonic mean. The summary cards show overall accuracy and the three standard ways of averaging F1 across classes — macro (treat every class equally), weighted (weight by how many true examples each class has), and micro (pool all decisions together). This is the same breakdown scikit-learn's classification report produces, computed instantly as you edit.

Reading a confusion matrix

A confusion matrix is a square table that lays a classifier's predictions against the truth. Each row is an actual class and each column a predicted class, so the cell at row i, column j counts the times something that was really class i got labelled class j. Correct predictions sit on the diagonal; everything off it is a specific, named mistake. That structure is far more informative than a single accuracy number, because it shows not just how often the model is wrong but how — whether it systematically confuses two similar classes, or floods one class with false positives.

From the matrix come the metrics that matter when accuracy alone misleads, which is most of the time on imbalanced data. Precision answers "when the model says this class, how often is it right?" and recall answers "of the real members of this class, how many did it catch?" — the two trade off against each other, and which you care about depends on whether false positives or false negatives are more costly. F1, their harmonic mean, is the usual single-number compromise. A spam filter wants high precision (don't junk real mail); a disease screen wants high recall (don't miss a case), and the confusion matrix makes that tension explicit per class.

Averaging across classes is where people slip up. Macro averaging treats every class as equally important regardless of size, so a rare class counts as much as a common one — good when minority classes matter. Weighted averaging scales each class by its support, so the result tracks overall performance on the data as it actually is distributed. Micro averaging pools all the true positives, false positives and false negatives before computing the metric, which for single-label problems equals overall accuracy. Reporting the right average for your goal — and showing the matrix alongside it — is what turns a model score into an honest description of behaviour.

Common use cases

  • Classifier evaluation. Turn a list of predictions and truths into a full metrics report.
  • Error analysis. Spot which classes a model confuses so you know where to add data.
  • Imbalanced data. Look past accuracy to per-class precision, recall and the right average.
  • LLM classification. Score an LLM used as a labeller against a gold set.

Frequently asked questions

How do I format the input?

One example per line, with the true label first and the predicted label second, separated by a comma, tab or semicolon — for example "positive,negative". Labels can be any text (cat, 1, spam, class_3); the tool discovers them automatically and sorts them for the matrix.

What's the difference between macro, micro and weighted F1?

Macro averages the per-class F1 scores equally, so every class counts the same. Weighted averages them by how many true examples each class has, tracking overall performance on your distribution. Micro pools all true/false positives and negatives first; for single-label classification it equals accuracy.

Which way round are rows and columns?

Rows are the actual (true) class and columns are the predicted class, matching the scikit-learn convention. The diagonal is correct predictions; a cell off the diagonal shows how many of one true class were predicted as another.

Why use precision and recall instead of accuracy?

On imbalanced data accuracy can be high while the model fails on the class you care about. Precision and recall describe per-class behaviour — how trustworthy a positive prediction is, and how many real positives are caught — so they reveal failures that a single accuracy figure hides.

Is my data uploaded?

No. Parsing and all metric calculations run entirely in your browser; nothing you paste leaves your machine.