Dokumentation (english)

Confusion Matrix

Evaluate classification model performance

Overview

A confusion matrix is a performance measurement tool for machine learning classification models. It displays the number of correct and incorrect predictions broken down by each class, showing true positives, true negatives, false positives, and false negatives in a matrix format.

Best used for:

  • Evaluating classification model accuracy
  • Understanding which classes are confused with each other
  • Identifying bias in model predictions
  • Comparing performance across different models
  • Analyzing precision, recall, and F1-score by class
  • Detecting overfitting or underfitting patterns

Common Use Cases

Machine Learning & AI

  • Binary classification evaluation (spam/not spam, fraud/legitimate)
  • Multi-class classification assessment
  • Model comparison and selection
  • Hyperparameter tuning evaluation
  • Feature importance validation

Medical & Diagnostics

  • Disease detection accuracy
  • Test result validation (positive/negative)
  • Screening program effectiveness
  • Diagnostic tool comparison

Quality Control

  • Defect detection system evaluation
  • Automated inspection accuracy
  • Classification system validation
  • Process control monitoring

Understanding the Confusion Matrix

Binary Classification (2×2 Matrix)

                Predicted
                Negative  Positive
Actual Negative    TN        FP
       Positive    FN        TP
  • True Positive (TP): Correctly predicted positive
  • True Negative (TN): Correctly predicted negative
  • False Positive (FP): Incorrectly predicted positive (Type I error)
  • False Negative (FN): Incorrectly predicted negative (Type II error)

Multi-Class Classification (N×N Matrix)

Each cell shows how many times class i was predicted as class j.

  • Diagonal: Correct predictions
  • Off-diagonal: Misclassifications

Key Metrics Derived

Accuracy

(TP + TN) / Total

Overall correctness of the model.

Precision

TP / (TP + FP)

Of all positive predictions, how many were correct?

Recall (Sensitivity)

TP / (TP + FN)

Of all actual positives, how many did we catch?

Specificity

TN / (TN + FP)

Of all actual negatives, how many were correctly identified?

F1-Score

2 × (Precision × Recall) / (Precision + Recall)

Harmonic mean of precision and recall.

Settings

Normalize

Optional - Display values as proportions instead of counts.

When enabled, shows percentages or proportions instead of raw counts, making it easier to compare models trained on different dataset sizes.

Options:

  • Off: Show raw counts
  • On: Show normalized values (0-1 or percentages)

Annotate Cells

Optional - Display values in each cell.

Shows the numerical value (count or percentage) in each cell of the matrix.

Default: On

Tips for Interpreting Confusion Matrices

  1. Focus on Off-Diagonal Values:

    • High off-diagonal values indicate confusion between classes
    • Look for systematic patterns in misclassification
    • Consider class similarity when evaluating errors
  2. Check Class Balance:

    • Imbalanced datasets can have misleading accuracy
    • Look at per-class metrics, not just overall accuracy
    • Consider using normalization for imbalanced data
  3. Understand Cost of Errors:

    • False positives vs false negatives have different costs
    • Medical: False negatives (missing disease) often worse
    • Spam: False positives (blocking real email) often worse
    • Adjust decision threshold based on cost
  4. Use Normalization Wisely:

    • Normalize by row (true class) to see recall per class
    • Normalize by column (predicted class) to see precision
    • Normalize by total to see overall distribution
  5. Compare Multiple Models:

    • Same confusion matrix format makes comparison easy
    • Look for improvements in specific error types
    • Consider which errors matter most for your application
  6. Combine with Other Metrics:

    • Confusion matrix shows details, but not the full picture
    • Use with ROC curves, precision-recall curves
    • Consider business metrics alongside statistical ones

Example Scenarios

Binary Classification (Fraud Detection)

High recall is critical—missing fraud is costly.

Multi-Class Classification (Product Categories)

Shows which product categories are commonly confused.

Normalized Confusion Matrix

Easier to compare when classes have different frequencies.

Medical Diagnosis

False negatives (missing disease) are more serious than false positives.

When to Use Different Metrics

Use Accuracy When:

  • Classes are balanced
  • All errors have equal cost
  • You need a simple single number

Use Precision When:

  • False positives are costly
  • You want confidence in positive predictions
  • Examples: spam detection, fraud detection

Use Recall When:

  • False negatives are costly
  • You want to catch all positives
  • Examples: disease screening, security threats

Use F1-Score When:

  • You need balance between precision and recall
  • Classes are imbalanced
  • You want a single metric better than accuracy

Troubleshooting

Issue: Model has high accuracy but performs poorly

  • Solution: Check if dataset is imbalanced. A model predicting all "negative" could have 95% accuracy if 95% of data is negative. Look at per-class metrics.

Issue: Can't see cell values clearly

  • Solution: Enable "Annotate Cells" setting. Consider using normalization if numbers are very large or very small.

Issue: Hard to compare models with different sample sizes

  • Solution: Enable "Normalize" to show proportions instead of raw counts. This makes models directly comparable.

Issue: Confusion between similar classes

  • Solution: This is normal when classes are similar (e.g., "cat" vs "dog"). Consider combining similar classes or improving features that distinguish them.

Issue: Perfect diagonal (all correct)

  • Solution: Might indicate overfitting, especially if validation performance is poor. Check if test data leaked into training.

Issue: Almost no true positives

  • Solution: Model might be biased toward negative class. Check class balance, try resampling, or adjust decision threshold.

Command Palette

Search for a command to run...

Schnellzugriffe
STRG + KSuche
STRG + DNachtmodus / Tagmodus
STRG + LSprache ändern

Software-Details
Kompiliert vor 1 Tag
Release: v4.0.0-production
Buildnummer: master@64a3463
Historie: 68 Items