Machine Learning Metrics - Precision, Recall, and Beyond
15th July 2018
When I was first taught machine learning, I was given some calculations for metrics. Accuracy was kind of obvious, but others such and Precision and Recall, Sensitivity and Specificity less so. These names didn’t seem to have much meaning to me and I was frequently mixing them up at the beginning because I didn’t have any real context in order to associate the names with their function. In this blog post I intend to present them with some real-world context, in order to give them more meaning with a view to making them easily understandable.
True and False, Positives and Negatives
At the core of understanding these is an understanding of true and false positives and negatives. I will use an example of a simple binary classifier, a spam filter throughout this post to illustrate my points, so in this case the definitions would be as follows:
- True Positive Spam that has been predicted spam (correctly identified).
- False Positive (Type I error) Not spam that has been predicted as spam (incorrectly identified).
- True Negative Not spam that has not been predicted as not spam. (correctly rejected)
- False Negative (Type II error) - Spam that has not been predicted as not spam. (incorrectly rejected)
I have added the labels Type I and II error, to give them their proper statistical names. The main pattern with these being the relationship between the prediction and the actual designation. Our classifier is trying to find spam, so for anything labelled spam that means it has been positively identified and so is a positive with the true/false prefix depending on whether that positive identification is correct or not. If it has been rejected as not spam then it is a negative result and the true/false prefix is whether that rejection is correct or not.
I find this the easiest way to remember these, If a binary classifier labels something that it is trying to find then it’s always a positive. If it rejects it then it is a negative. If it labels email spam that is not spam then it’s a false positive, if it doesn’t label email that is spam then it’s a false negative. Just think of it backwards as the label first (positive/negative) then the correctness of that label (true/false).
With these terms clearly defined, we can use them to calculate the metrics used to determine if our spam filter is any good or not. I already mentioned Accuracy as being the most obvious one, as in how accurate is our classifier. This would be calculated by adding up the total number of correct predictions and dividing them by the total number of predictions, given by the equation:
Or more formally:
Where TP = True Positive, FP = False Positive, TN = True Negative, FN = False Negative.
Accuracy gives us the overall correctness, but it can have its limits. For example say we have a faulty spam filter that doesn’t classify anything as spam. Say 2% of our email is spam with the remaining 98% genuine email.
This would give us an accuracy of 98% and it would seem like the spam filter is great at its job, but it’s missed all the spam. Unless our data are near equally balanced then accuracy doesn’t tell us much, and while missing the spam would be inconvenient there could be worse outcomes. For instance predicting a fatal disease, or determining if an athlete has used a banned substance. This suggests the question of what is worse, a false positive or a false negative.
There is no easy answer to this as it is dependent on the problem. One could be worse than the other, neither could be that bad, or they both could be bad. In our spam filter example neither is that harmful really, but in predicting a potentially fatal disease, a false negative could cost someone their life, while for an athlete testing for a banned substance, a false positive could cost them a tournament and possible their career. Obviously accuracy on its own is not enough and we’ll need to take into account some other metrics that would give us a more complete picture.
The Confusion Matrix
When assessing classifiers it helps to use a confusion matrix. This is simply a table generated after testing showing which data samples a classifier has predicted positive or negative and whether those predictions are true or false. It can be visualised as such:
Precision and Recall
Using a confusion matrix makes it easier to calculate two important metrics called precision and recall, which are defined as:
- Precision - how many predictions are relevant.
- Recall - How many relevant items were predicted.
These both sound similar when defined like this, but think of them this way:
- Recall - How good our model is at capturing the samples that it needs.
- Precision - Of those captured samples, how many are correct.
If I highlight our confusion matrix with the relevant sections it will make it easier to see the difference between them.
Using our spam filter example where we predicted nothing as spam, precision would be a measure of how much predicted spam is actually spam, and recall would tell us how much actual spam have we caught. With our faulty spam filter of 98% accuracy where only 2% was actual spam we would get the results:
Now the fault becomes clear. Our seemingly brilliant 98% accurate spam filter does not catch any spam, and because it does not catch any spam then it’s not precise, both precision and recall are zero.
Lets look at an example where our spam filter isn’t so faulty.
This time it has successfully caught some spam, out of 100 emails it has successfully classified 15 as spam and 70 as not spam giving it a respectable accuracy of 85%, and when looking at the precision and recall.
The precision is 1 (100%) because of the spam it has caught and it has correctly classified every sample, but the recall is only 0.5 (50%) because it has only caught half of the total spam.
If we use another example where the false positives and negatives are more equally assigned and the accuracy is the same 85%.
We would get more equal precision and recall results.
Sensitivity and Specificity
Recall is also sometimes called by another name Sensitivity, it is the True Positive Rate as in how many positive samples have been identified. This can be contrasted with another related metric, the True Negative Rate known as Specificity, how many negative samples have been identified.
Applying this to our spam filter for both cases we get:
We identified all of the negative samples.
We identified 89% of the negative samples.
An F1-score (also called F-score or F-measure) can be calculated using precision and recall which is a harmonic mean of both metrics. A harmonic mean is used so as not to be too influenced by extreme values. To raise the F1-score we would need both a high precision and recall. The equation is given as:
Applying this to our two examples we get a near equal result which would be expected seeing as we had the same number of false results in both. We can see here that the 100% precision in example one is tempered so as not to give an extreme result.
Looking at these, it would seem it would not make any difference which model we chose, if we had to choose one above the other using an F1-score.
Matthews Correlation Coefficient
Another metric considered to be more informative than the F1-score is the Matthews Correlation Coefficient which takes into account all the true/false positives/negatives, and returns a result with a range between -1 (hopeless) and 1 (perfect). The equation is given by:
Applying this we get the results:
By using this metric we can see that example one has the advantage. Returning back to the question of which is worse, a false positive or a false negative yet again. Example one would see half of the spam in our inbox, where example two would see some in our inbox and some genuine email in our spam folder. Rather than purely go by calculations, it also makes sense to apply the results back to the actual task. Personally I would prefer example one because it would be less work to mark the false negatives as spam than it would be to do that and go into the spam folder to mark the false positives as genuine.
ROC Curves and AUC
Also sometimes used to assess a model are ROC (Receiver Operating Characteristic) curves. These are a plot of the True Positive Rate (Sensitivity) on the y-axis against the True Negative Rate (1 - Specificity) on the x-axis over the entire operating range so as to find the best threshold for identifying a positive. There is a relationship between the two rates so if we had a False Positive Rate of 1 then the True Positive Rate would be 0 because all the positives labelled would be wrong, and vice versa. The idea being to find the best trade-off between the two.
Different ROC curves can be assessed using the Area Under the Curve (AUC), giving a result between 0 and 1, with the higher number being better.
As we weren’t tuning anything in our examples above, I have randomly generated some dummy data for example two models to demonstrate. We can use the ROC curves to compare the two models as well as find the ideal threshold. The red line represents a random classifier, and the purple and blue lines the two different classifiers. The blue line appears to cover more area then the purple and looks slightly better, and calculating the AUC for both, allows us to confirm this in greater detail.
As well as the comparison of the models we can see that the ideal threshold on the blue line is at 0.6. As a rule we want the point that’s nearest to the top left corner.
The code for generating this ROC curve in R is available on my GitHub account. I have used ggplot2 in this example although there are specific R libraries especially for generating ROC curves such as pROC and ROCR.
Precision and Recall for multiple classes
So far we have only seen examples for binary classification, but what about when we have multiple classes? Both precision and recall can be calculated for multiple classes, but with a slight difference. Instead of applying to the whole model, they are calculated for each separate class individually. For example, say we have three classes labelled A, B, and C we would start with class A and calculate the precision and recall similar to the binary class problem but we would sum across the row for precision, and the column for recall like so:
Which we would then repeat for classes B and C.
The formal mathematical definition would be:
Where M is the confusion matrix, i is the row and j is the column.
I hope that this post has clarified the different metrics with a view to how they differ and why they are used. As I have mentioned throughout this post, which ones would be the most important to optimise would be entirely dependent on the problem, so it is important to understand each one and the information they provide so as to select the correct metrics for your model. Machine learning, like the statistical methods that underpin it, deals with uncertainty so rarely will anything be straightforward, and these tools provide us with a way to navigate through that uncertainty in order to achieve the best results.