ML Fundamentals Cheat Sheet: Confusion Matrix, Accuracy, Precision, Recall, Sensitivity, Specificity, F score, Type I and Type II Errors, Precision-Recall Trade-Off, and ROC

A Whirlwind Tour of Classification Metrics

Shan Dou
6 min readDec 21, 2020

Confusion Matrix

The confusion matrix as a visual tool is a great jumping-off point for introducing classification metrics. How one defines row- and column-axes of the confusion matrix could differ depends on specific domains. In this cheat sheet, we follow the convention used by scikit-learn’s plot_confusion_matrix where the rows are the ground truths ( y_true) and the columns are the predictions ( y_pred).

Figure 1: Example of a confusion matrix for a binary classifier

When a confusion matrix’s off-diagonal elements are high, we would have a confused classifier at hand.

Type I and Type II Errors

The idea of making concepts as visceral as possible has been a new learning technique that I want to apply whenever possible. One of the most memorable stats humour on Type I and Type II errors are the pregnancy test in Figure 2:

Figure 2: Great meme for internalizing type I and type II error concepts (image source)

Type I and II errors have the same meaning to how they are typically used in the context of hypothesis tests:

  • Type I error = false positive
  • Type II error = false negative

Accuracy

Accuracy could be a deceptively reasonable-sounding metric for classifiers: It is the proportion of the correctly predicted labels among all our predictions. However, because accuracy does not discriminate between positive and negative cases, it could lead to catastrophic failures when the prediction targets have imbalance classes.

One great example to invoke is the spam filter. Spams are annoying but relatively rare. Assuming a true spam rate of 2%, we could have a dumb filter that simply labels every email as non-spam. In doing so, the filter is correct 98% of the time, which is equivalent to an impressive accuracy score of 0.98, but this dumb filter is also clearly useless.

Therefore, accuracy can only be useful when the classes are well balanced and we care equally about both the positive and negative cases. I would even argue that accuracy is rarely useful. This is backed by its absence in scikit-learn’s sklearn.metrics.classfication_report .

Precision

Precision addresses the question: “Among all the positive labels predicted by the model, how many are indeed positive?”

In other words, it measures the amount of type I errors that the model makes.

By definition, precision is the proportion of correctly identified positive labels (TP) among all the predicted positive labels (TP + FP). Because low FP yields high precision, precision is an excellent metric when minimizing false-positives takes priority (e.g., a spam filter misidentifies legitimate emails as spam). However, when the positive cases are rare, precision alone is not enough to warn us against the case of high false negatives.

Recall (=Sensitivity)

Recall addresses the question: “Among all the true positives, how many of them are indeed correctly captured by the model?”

In other words, it measures the amount of type II errors the model makes.

Recall, on the other hand, is the proportion of positive labels (TP + FN) that are correctly identified (TP). We can immediately see that if a classifier has high false-negative counts (e.g., the dumb spam filter that labels every email as non-spam), the recall score will reveal this flaw. The recall score is of particular interest when minimizing false negatives takes priority (e.g., screening for a fatal infectious disease; a false negative would send a patient home without timely treatment).

F1-Score

When we care about both precision and recall, F1-score, which is the harmonic mean (refresher: harmonic mean gives lower values a higher weighting) of precision and recall, is the go-to metric for evaluating classifiers:

The F-beta score is the weight adjustable variant of F1-score. As explained in scikit-learn documentation:

The beta parameter determines the weight of recall in the combined score. beta < 1 lends more weight to precision, while beta > 1 favors recall (beta -> 0 considers only precision, beta -> +inf only recall).

Specificity

Specificity is the mirror image of recall (recall is also known as sensitivity): It tells us the proportion of correctly identified negative labels (TN)among all the negative labels (TN + FP). Specificity also is a key ingredient in the ROC curve that is going to be covered in the next section: 1 — specificity (= FP / (TN + FP) = false positive rate)is the x-axis of the ROC curve.

Intuition On Precision-Recall Trade-Off (or Recall-TPR Trade-Off)

Precision focuses on minimizing false positives whereas recall focuses on minimizing false negatives. However, we cannot have both and a trade-off exists between the two criteria. One useful mental imagery is to imaging positive and negative cases as two distributions with overlaps:

Figure 3: Intuition of precision-recall trade-off illustrated with overlapping distributions and corresponding decision thresholds

The Receiver Operating Characteristic Curve (ROC)

The ROC curve could be viewed as the PR curve rotated by 90 degrees (with recall now on the vertical axis) and then horizontally flipped (though not exact; the horizontal axis of ROC curve is the false positive rate = 1 — specificity). Just like the precision-recall trade-off manifested by the PR curve, ROC curve shows the trade-off between true-positive rate and false-positive rate. The same tradeoff exists: The more true positives we wish to identify, the more false positives we would have to let in.

Why the odd name? The method was originally developed for operators of military radar receivers, which is why it is so named.

How to choose between ROC curve and PR curve?

Since the ROC curve is so similar to the precision/recall (PR) curve, you may wonder how to decide which one to use. As a rule of thumb, you should prefer the PR curve whenever the positive class is rare or when you care more about the false positives than the false negatives. Otherwise, use the ROC curve. For example, looking at the previous ROC curve (and the ROC AUC score), you may think that the classifier is really good. But this is mostly because there are few positives (5s) compared to the negatives (non-5s). In contrast, the PR curve makes it clear that the classifier has room for improvement (the curve could be closer to the top-right corner).

— Geron, A., 2019, “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow”, 2nd Edition

How to remember this? One not-so-rigorous and yet useful memory trick: What makes ROC less suitable in the above mentioned “rare positives and high-stake false positive” is that FP is never present solely as a part of the denominator (xaxis = FP / (FP + TN); yaxis = TP/(TP + FN)). By contrast, FP has a stand-alone presence as part of the denominator in precision ( precison = TP / (TP + FP) ). Division by small number is a powerful booster that signifies a classifier with low FP.

Good Reads & Useful Resources

  1. Geron, A., 2019, “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow”, 2nd Edition
  2. Raschka, S. and Mirjalili, V., 2019, “Learning Best Practices for Model Evaluation and Hyperparameter Tuning”, Python Machine Learning — Third Edition
  3. Data School video, Nov 19, 2014, “ROC Curves and Area Under the Curve (AUC) Explained

--

--