Confusion Matrix Calculator
Paste your actual,predicted label pairs and get a full confusion matrix with the metrics that go with it: overall accuracy, per-class precision, recall and F1, and macro, micro and weighted averages. It works for binary or multi-class problems with any label names, builds the matrix automatically, and computes everything in your browser.
How to use the Confusion Matrix Calculator
Provide one example per line as the true label, a separator (comma, tab or semicolon), then the predicted label — for instance spam,ham. The tool collects every distinct label, sorts them, and tallies each (actual, predicted) combination into the matrix, where the diagonal holds correct predictions and off-diagonal cells show exactly which classes are being confused for which. Cells are shaded so the mistakes stand out at a glance.
Below the matrix you get the metrics. Precision for a class is how many of the items predicted as that class were right; recall is how many of the actual members of that class were found; F1 is their harmonic mean. The summary cards show overall accuracy and the three standard ways of averaging F1 across classes — macro (treat every class equally), weighted (weight by how many true examples each class has), and micro (pool all decisions together). This is the same breakdown scikit-learn's classification report produces, computed instantly as you edit.
Reading a confusion matrix
A confusion matrix is a square table that lays a classifier's predictions against the truth. Each row is an actual class and each column a predicted class, so the cell at row i, column j counts the times something that was really class i got labelled class j. Correct predictions sit on the diagonal; everything off it is a specific, named mistake. That structure is far more informative than a single accuracy number, because it shows not just how often the model is wrong but how — whether it systematically confuses two similar classes, or floods one class with false positives.
From the matrix come the metrics that matter when accuracy alone misleads, which is most of the time on imbalanced data. Precision answers "when the model says this class, how often is it right?" and recall answers "of the real members of this class, how many did it catch?" — the two trade off against each other, and which you care about depends on whether false positives or false negatives are more costly. F1, their harmonic mean, is the usual single-number compromise. A spam filter wants high precision (don't junk real mail); a disease screen wants high recall (don't miss a case), and the confusion matrix makes that tension explicit per class.
Averaging across classes is where people slip up. Macro averaging treats every class as equally important regardless of size, so a rare class counts as much as a common one — good when minority classes matter. Weighted averaging scales each class by its support, so the result tracks overall performance on the data as it actually is distributed. Micro averaging pools all the true positives, false positives and false negatives before computing the metric, which for single-label problems equals overall accuracy. Reporting the right average for your goal — and showing the matrix alongside it — is what turns a model score into an honest description of behaviour.
Common use cases
- Classifier evaluation. Turn a list of predictions and truths into a full metrics report.
- Error analysis. Spot which classes a model confuses so you know where to add data.
- Imbalanced data. Look past accuracy to per-class precision, recall and the right average.
- LLM classification. Score an LLM used as a labeller against a gold set.