Example of Precision-Recall metric to evaluate classifier output quality.
Precision-Recall is a useful measure of success of prediction when theclasses are very imbalanced. In information retrieval, precision is ameasure of the fraction of relevant items among actually returned items while recallis a measure of the fraction of items that were returned among all items that shouldhave been returned. ‘Relevancy’ here refers to items that arepostively labeled, i.e., true positives and false negatives.
Precision (\(P\)) is defined as the number of true positives (\(T_p\))over the number of true positives plus the number of false positives(\(F_p\)).
\[P = \frac{T_p}{T_p+F_p}\]
Recall (\(R\)) is defined as the number of true positives (\(T_p\))over the number of true positives plus the number of false negatives(\(F_n\)).
\[R = \frac{T_p}{T_p + F_n}\]
The precision-recall curve shows the tradeoff between precision andrecall for different thresholds. A high area under the curve representsboth high recall and high precision. High precision is achieved by havingfew false positives in the returned results, and high recall is achieved byhaving few false negatives in the relevant results.High scores for both show that the classifier is returningaccurate results (high precision), as well as returning a majority of all relevantresults (high recall).
A system with high recall but low precision returns most of the relevant items, butthe proportion of returned results that are incorrectly labeled is high. Asystem with high precision but low recall is just the opposite, returning veryfew of the relevant items, but most of its predicted labels are correct when comparedto the actual labels. An ideal system with high precision and high recall willreturn most of the relevant items, with most results labeled correctly.
The definition of precision (\(\frac{T_p}{T_p + F_p}\)) shows that loweringthe threshold of a classifier may increase the denominator, by increasing thenumber of results returned. If the threshold was previously set too high, thenew results may all be true positives, which will increase precision. If theprevious threshold was about right or too low, further lowering the thresholdwill introduce false positives, decreasing precision.
Recall is defined as \(\frac{T_p}{T_p+F_n}\), where \(T_p+F_n\) doesnot depend on the classifier threshold. Changing the classifier threshold can onlychange the numerator, \(T_p\). Lowering the classifierthreshold may increase recall, by increasing the number of true positiveresults. It is also possible that lowering the threshold may leave recallunchanged, while the precision fluctuates. Thus, precision does not necessarilydecrease with recall.
The relationship between recall and precision can be observed in thestairstep area of the plot - at the edges of these steps a small changein the threshold considerably reduces precision, with only a minor gain inrecall.
Average precision (AP) summarizes such a plot as the weighted mean ofprecisions achieved at each threshold, with the increase in recall from theprevious threshold used as the weight:
\(\text{AP} = \sum_n (R_n - R_{n-1}) P_n\)
where \(P_n\) and \(R_n\) are the precision and recall at thenth threshold. A pair \((R_k, P_k)\) is referred to as anoperating point.
AP and the trapezoidal area under the operating points(sklearn.metrics.auc
) are common ways to summarize a precision-recallcurve that lead to different results. Read more in theUser Guide.
Precision-recall curves are typically used in binary classification to studythe output of a classifier. In order to extend the precision-recall curve andaverage precision to multi-class or multi-label classification, it is necessaryto binarize the output. One curve can be drawn per label, but one can also drawa precision-recall curve by considering each element of the label indicatormatrix as a binary prediction (micro-averaging).
Note
In binary classification settings#
Dataset and model#
We will use a Linear SVC classifier to differentiate two types of irises.
import numpy as npfrom sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_splitX, y = load_iris(return_X_y=True)# Add noisy featuresrandom_state = np.random.RandomState(0)n_samples, n_features = X.shapeX = np.concatenate([X, random_state.randn(n_samples, 200 * n_features)], axis=1)# Limit to the two first classes, and split into training and testX_train, X_test, y_train, y_test = train_test_split( X[y < 2], y[y < 2], test_size=0.5, random_state=random_state)
Linear SVC will expect each feature to have a similar range of values. Thus,we will first scale the data using aStandardScaler
.
from sklearn.pipeline import make_pipelinefrom sklearn.preprocessing import StandardScalerfrom sklearn.svm import LinearSVCclassifier = make_pipeline(StandardScaler(), LinearSVC(random_state=random_state))classifier.fit(X_train, y_train)
Pipeline(steps=[('standardscaler', StandardScaler()), ('linearsvc', LinearSVC(random_state=RandomState(MT19937) at 0x7FCE95226240))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Plot the Precision-Recall curve#
To plot the precision-recall curve, you should usePrecisionRecallDisplay
. Indeed, there is twomethods available depending if you already computed the predictions of theclassifier or not.
Let’s first plot the precision-recall curve without the classifierpredictions. We usefrom_estimator
thatcomputes the predictions for us before plotting the curve.
from sklearn.metrics import PrecisionRecallDisplaydisplay = PrecisionRecallDisplay.from_estimator( classifier, X_test, y_test, name="LinearSVC", plot_chance_level=True)_ = display.ax_.set_title("2-class Precision-Recall curve")
If we already got the estimated probabilities or scores forour model, then we can usefrom_predictions
.
y_score = classifier.decision_function(X_test)display = PrecisionRecallDisplay.from_predictions( y_test, y_score, name="LinearSVC", plot_chance_level=True)_ = display.ax_.set_title("2-class Precision-Recall curve")
In multi-label settings#
The precision-recall curve does not support the multilabel setting. However,one can decide how to handle this case. We show such an example below.
Create multi-label data, fit, and predict#
We create a multi-label dataset, to illustrate the precision-recall inmulti-label settings.
from sklearn.preprocessing import label_binarize# Use label_binarize to be multi-label like settingsY = label_binarize(y, classes=[0, 1, 2])n_classes = Y.shape[1]# Split into training and testX_train, X_test, Y_train, Y_test = train_test_split( X, Y, test_size=0.5, random_state=random_state)
We use OneVsRestClassifier
for multi-labelprediction.
from sklearn.multiclass import OneVsRestClassifierclassifier = OneVsRestClassifier( make_pipeline(StandardScaler(), LinearSVC(random_state=random_state)))classifier.fit(X_train, Y_train)y_score = classifier.decision_function(X_test)
The average precision score in multi-label settings#
from sklearn.metrics import average_precision_score, precision_recall_curve# For each classprecision = dict()recall = dict()average_precision = dict()for i in range(n_classes): precision[i], recall[i], _ = precision_recall_curve(Y_test[:, i], y_score[:, i]) average_precision[i] = average_precision_score(Y_test[:, i], y_score[:, i])# A "micro-average": quantifying score on all classes jointlyprecision["micro"], recall["micro"], _ = precision_recall_curve( Y_test.ravel(), y_score.ravel())average_precision["micro"] = average_precision_score(Y_test, y_score, average="micro")
Plot the micro-averaged Precision-Recall curve#
from collections import Counterdisplay = PrecisionRecallDisplay( recall=recall["micro"], precision=precision["micro"], average_precision=average_precision["micro"], prevalence_pos_label=Counter(Y_test.ravel())[1] / Y_test.size,)display.plot(plot_chance_level=True)_ = display.ax_.set_title("Micro-averaged over all classes")
Plot Precision-Recall curve for each class and iso-f1 curves#
from itertools import cycleimport matplotlib.pyplot as plt# setup plot detailscolors = cycle(["navy", "turquoise", "darkorange", "cornflowerblue", "teal"])_, ax = plt.subplots(figsize=(7, 8))f_scores = np.linspace(0.2, 0.8, num=4)lines, labels = [], []for f_score in f_scores: x = np.linspace(0.01, 1) y = f_score * x / (2 * x - f_score) (l,) = plt.plot(x[y >= 0], y[y >= 0], color="gray", alpha=0.2) plt.annotate("f1={0:0.1f}".format(f_score), xy=(0.9, y[45] + 0.02))display = PrecisionRecallDisplay( recall=recall["micro"], precision=precision["micro"], average_precision=average_precision["micro"],)display.plot(ax=ax, name="Micro-average precision-recall", color="gold")for i, color in zip(range(n_classes), colors): display = PrecisionRecallDisplay( recall=recall[i], precision=precision[i], average_precision=average_precision[i], ) display.plot(ax=ax, name=f"Precision-recall for class {i}", color=color)# add the legend for the iso-f1 curveshandles, labels = display.ax_.get_legend_handles_labels()handles.extend([l])labels.extend(["iso-f1 curves"])# set the legend and the axesax.legend(handles=handles, labels=labels, loc="best")ax.set_title("Extension of Precision-Recall curve to multi-class")plt.show()
Total running time of the script: (0 minutes 0.372 seconds)
Related examples
Custom refit strategy of a grid search with cross-validation
Custom refit strategy of a grid search with cross-validation
Visualizations with Display Objects
Visualizations with Display Objects
Sparse inverse covariance estimation
Sparse inverse covariance estimation
Post-tuning the decision threshold for cost-sensitive learning
Post-tuning the decision threshold for cost-sensitive learning