Precision-Recall Curve
Evaluate model performance on imbalanced datasets by visualizing precision vs recall trade-offs
Use me when your dataset has far more negatives than positives — and a good-looking ROC curve is hiding how badly your model actually performs on the rare class you care about. I'm the honest evaluation tool for fraud detection, rare disease screening, anomaly detection, and any other problem where positives are precious and the baseline is easy to beat by simply predicting "no" every time.
Overview
A Precision-Recall (PR) curve plots Precision (the fraction of positive predictions that are actually positive) on the Y axis against Recall (the fraction of actual positives that were correctly found) on the X axis, sweeping across every possible classification threshold. Each point on the curve answers: "If I set my threshold here, how confident can I be in each positive prediction, and what proportion of real positives am I catching?"
The Average Precision (AP) summarises the curve as the weighted mean of precisions at each threshold, equivalent to the area under the PR curve:
- High AP (near 1.0) — model maintains high precision even at high recall; strong separator
- Low AP (near class prevalence) — model barely outperforms a random guesser that always predicts positive
The baseline for a PR curve is a horizontal line at the class prevalence (e.g., y = 0.1 if 10% of samples are positive). Any useful classifier must sit consistently above this line.
Requires a trained model. This plot belongs to the evaluation category and uses training data. You must have a trained model node upstream in your pipeline before this plot can be generated.
Best used for:
- Evaluating classifiers on datasets with rare positive classes
- Choosing a threshold that balances precision against recall for your specific cost structure
- Comparing models in settings where false positives and false negatives have very different business costs
- Diagnosing whether a high ROC AUC is masking poor positive-class performance
- Communicating trade-offs between catching more cases vs. avoiding false alarms
Common Use Cases
Imbalanced Classification
- Fraud detection — fraudulent transactions may be < 0.1% of all transactions; ROC AUC can be high even when the model misses most fraud. AP reveals the real story.
- Medical screening for rare conditions — catching every positive case (high recall) may be mandatory, but generating too many false referrals (low precision) wastes clinical resources.
- Anomaly detection in manufacturing — defects are rare; precision tells you how many alarms require actual investigation.
- Content moderation — spam, abuse, or harmful content is a small fraction of all content; PR curves help tune the trade-off between over- and under-blocking.
Threshold Selection
Different applications have very different costs for false positives vs. false negatives. The PR curve lets you visualise every possible threshold and pick the operating point that fits your cost structure before deployment.
Model Comparison
Plot multiple models' PR curves on the same axes. The model with a curve consistently closer to the top-right corner — and a higher AP — better handles the positive class across all thresholds.
Settings
Show Average Precision
Optional — Display the AP (area under the PR curve) score in the legend.
Default: On
When enabled, the AP score is appended to the trace name in the legend (e.g., PR Curve (AP = 0.74)). AP is a threshold-independent, single-number summary of positive-class performance and is the standard reporting metric for PR curve quality.
Interpreting the Precision-Recall Curve
Reading the Curve
Top-right corner (ideal): A model that achieves both high precision and high recall simultaneously sits near the top-right corner. In practice, there is always a trade-off — as recall increases (lower threshold), precision typically drops.
Baseline (horizontal dashed line): The baseline represents a random classifier that scores every example with the same constant. Its precision equals the class prevalence (e.g., 0.2 if 20% of samples are positive). Any curve above this line beats random chance; the higher above it, the better.
Steep drop-off: A curve that holds high precision until moderate recall, then drops sharply, indicates the model is highly confident about its top predictions but struggles to find the remaining positive cases without collecting many false positives.
The Precision-Recall Trade-off
Changing the classification threshold moves you along the PR curve in opposite directions:
| Threshold direction | Recall | Precision |
|---|---|---|
| Lower threshold (predict positive more often) | Increases | Tends to decrease |
| Higher threshold (predict positive less often) | Decreases | Tends to increase |
This fundamental trade-off means you cannot simply maximise both — you must choose based on which type of error is more costly in your application.
Choosing the Optimal Threshold
- Equal-cost criterion — pick the point closest to (Recall=1, Precision=1), the ideal corner.
- Precision-priority (few false alarms) — move to higher threshold; accept lower recall.
- Recall-priority (catch everything) — move to lower threshold; accept lower precision.
- F1-score maximum — find the point on the curve that maximises
2 × (P × R) / (P + R).
Average Precision as a Single Number
AP is computed as the weighted mean of precisions at each recall level, weighted by the change in recall between consecutive thresholds. It is equivalent to the area under the PR curve and ranges from the class prevalence (worst) to 1.0 (best). AP is preferred over ROC AUC when reporting results on imbalanced benchmarks.
PR Curve vs ROC Curve
| Situation | Prefer |
|---|---|
| Balanced classes | ROC Curve |
| Highly imbalanced classes (rare positives) | PR Curve |
| You care about both classes equally | ROC Curve |
| You mostly care about the positive class | PR Curve |
| Comparing across datasets with different prevalence | PR Curve |
The key insight: on imbalanced data, a model can achieve high ROC AUC simply because it correctly classifies the abundant negative class. The PR curve ignores true negatives entirely, so it cannot be gamed this way.
Tips for Effective Use
-
Always show the baseline — the horizontal line at class prevalence is your anchor. A curve that barely rises above it signals a model that has learned very little about the positive class.
-
Report AP, not just the curve — stakeholders need a single number for model comparison. AP is the standard choice for imbalanced evaluation benchmarks.
-
Cross-check with the ROC curve — if ROC AUC is high but AP is low, your model is doing well on the majority class but poorly on the minority class you care about.
-
Use the F1-score to pick a threshold — for balanced precision/recall importance, the F1-maximum point on the PR curve is a principled default threshold.
-
Consider class-weighted AP for multi-class problems — macro-average AP weights all classes equally; micro-average AP weights by class frequency. Choose based on whether rare classes matter as much as common ones.
-
Combine with the Confusion Matrix — once you select a threshold from the PR curve, validate TP/FP/FN counts in the Confusion Matrix to ensure the threshold performs as expected in absolute terms.
Related Visualizations
- ROC Curve — evaluates classifier quality across thresholds; more optimistic on imbalanced data; use alongside PR for a complete picture
- Confusion Matrix — shows the full error breakdown at a single chosen threshold
- SHAP Feature Impact — explains which features drive the classifier's positive-class predictions
- SHAP Dependence Plot — examines how individual feature values push predictions toward or away from the positive class