Precision, Recall, Accuracy, or F1 score?

4 min readJul 19, 2021

You want to build a model to classify data points in your dataset into two or more classes and you need to evaluate how well your model has performed, but which metric do you choose??

This blog demonstrates how to evaluate the performance of a model via Accuracy, Precision, Recall, and F1 score metrics. In this experiment, I have used a Ternary-class Boosted Random Forest Algorithm and my goal is to predict the functionality of water wells in Tanzania, classifying them as ‘Functional’, ‘Non-functional’, or ‘Functional Needs repair’. The data set is part of an ongoing competition until Nov 1st, 2021. For the simplicity of this blog, I will drop the ‘Functional needs repair’ class and treat the project as binary classification.

Pump it Up: Data Mining the Water Table

You cannot sign up to DrivenData from multiple accounts and therefore you cannot submit from multiple accounts…

www.drivendata.org

Confusion Matrix.

Before going into the metrics let’s first look at a brief explanation of the ‘Confusion Matrix’. And yes, it can be a little confusing sometimes!

True Positives and True Negatives are the observations that are correctly predicted, indicated by the green color in the image. The False Positives and False Negatives indicating our model’s failure to capture the actual values are shown in red. Now let’s clarify the meaning of each of those terms and see how they contribute to the interpretation of our metrics.

True Positives (TP) — These are the positive values our model was able to correctly predict. For example, in this case, an actual ‘Functional’ well would be predicted and labeled as ‘Functional’ by our model.

True Negatives (TN) — These are the negative values our model was able to correctly predict. For example, if the actual class of a well says it is ‘Non-functional’ then the predicted label is also ‘Non-functional’.

False positives and False negatives are the values that occur when your actual class contradicts with the predicted class.

False Positive (FP) — These are the values in which our model predicted as positive when they were in fact actually negative. In our case, a well would be predicted as ‘Functional’ when it was actually ‘Non-functional’.

False Negative (FN) — These are the values that our model predicted as negative when they were actually positive. Taking our wells example once again, our model would have predicted an actually ‘Functional’ well as ‘Non-functional’.

Now that we understand these four parameters, let’s move on to see how they contribute to calculating Accuracy, Precision, Recall, and F1-score.

Accuracy — Accuracy is one metric for evaluating classification models on equally important classes. Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has the following definition:

Although Accuracy is the most intuitive performance measure, it can often be misleading especially when we have a class imbalance in our datasets. It is only reliable when we have symmetric datasets where the values of False-positive and False-negative are more or less balanced.

Precision — can be defined as the measure of the actual correctly identified positive cases from all the predicted positive cases. In our case Precision gives us a measure of how many of the ‘Functional’ wells that our model predicted were actually ‘Functional’.

Recall — is the measure of how many of the actual values of a class our model was able to capture. It is important when the cost of False negatives is high.

F1-score — is the harmonic mean of Recall and Precision and gives a better representation and measure of the incorrectly classified classes (False positive & False negative) than an Accuracy metric. More importantly, it is the ideal metric if a dataset has a class imbalance and the cost of False positives and False negatives is very different.

In conclusion, when ever building a classification model, I hope this blog helps you figure out and make an informed decision on which metrics to consider when testing your model’s performance.

Precision, Recall, Accuracy, or F1 score?

Pump it Up: Data Mining the Water Table

You cannot sign up to DrivenData from multiple accounts and therefore you cannot submit from multiple accounts…

Confusion Matrix.

Written by Milena Afeworki