Beginners Guide to Precision and Recall in Machine Learning

Ayush Parti

June 6th, 2024

This article provides an in-depth look at precision and recall, two critical metrics in machine learning. It explains their importance, how to calculate them, and when to prioritize one over the other. We've also looked at practical examples and discussed the trade-offs involved in balancing these metrics, particularly in imbalanced datasets.

Accuracy is the primary metric used for assessing AI models. Nevertheless, it lacks comprehensiveness and detail.

This study calls this phenomenon the “accuracy paradox,” which states that:

“In the framework of imbalanced data sets, accuracy is no longer a proper measure since it does not distinguish between the numbers of correctly classified examples of different classes. Hence, it may lead to erroneous conclusions.”

This leaves us with precision and recall, which address this gap. They provide a deeper understanding of a model's abilities by breaking down its performance for each class. Depending on the specific problem, focusing on precision might be more beneficial in some cases, while recall takes precedence in others.

This article defines these metrics, how to calculate them, and use cases.

What is Precision and Recall in Machine Learning?

Precision refers to the percentage of positive predictions that are correct. For instance, if a model predicts 100 samples as positive, and 80 of those are genuinely positive (while the other 20 are incorrectly predicted as positive), the precision would be 80%.

Let’s put it into perspective with a relatable example. Imagine you have an email spam filter that classifies incoming emails as either "spam" or "not spam." In one day, the filter flags 50 emails as spam. Out of these, 40 are actual spam emails, and 10 are legitimate emails that were mistakenly marked as spam.

Precision in this context would be the proportion of emails correctly identified as spam out of all the emails flagged as spam. Here’s how you calculate it:

Total emails flagged as spam: 50
Actual spam emails correctly identified: 40

The spam filter's precision is calculated by dividing the number of correctly identified spam emails (40) by the total number of emails flagged as spam (50), resulting in a precision of 0.8 or 80%.

This means that 80% of the emails the filter marked as spam were spam, reflecting the spam filter's precision.

On the other hand, recall measures the percentage of actual positive samples that a model correctly identifies. For instance, if a test dataset has 100 positive samples and the model successfully identifies 60, the recall rate is 60%.

In this context, recall is the proportion of actual spam emails correctly identified by the filter. If there are 100 spam emails in your inbox and the filter accurately flags 70 of them as spam but misses the remaining 30, recall would measure this accuracy. This means the spam filter correctly identifies 70% of the spam emails, reflecting the filter's recall.

Pros and Cons of Precision

You must understand the main advantages and drawbacks before you pick the right metric. Let’s start with the pros.

Focus on Accuracy of Positive Predictions: Precision highlights how accurate the model is when it predicts a positive class, making it valuable in scenarios where false positives are costly or undesirable.
Useful for Imbalanced Datasets: Precision is particularly beneficial for datasets with a significant class imbalance, ensuring the model doesn't produce too many false positives.
High Precision Indicates Trustworthiness: A high precision score means that when the model predicts a positive class, it is likely correct, which is crucial in applications like medical diagnoses or fraud detection.

Precision has its drawbacks as well.

Ignores False Negatives: Precision does not account for false negatives, so it doesn't provide a complete picture of the model's performance, especially in cases where missing positive instances is critical.
Not Comprehensive Alone: Precision should be used with other metrics, such as recall, to understand the model's effectiveness better.
May Lead to Overly Conservative Models: Focusing too much on precision can lead to overly conservative models in predicting positive classes, potentially missing many true positives.

Pros and Cons of Recall

The pros and cons are similar, with slight distinctions.

Emphasis on Identifying Positives: Recall focuses on the model's ability to identify all positive instances, which is crucial in scenarios where missing positive cases is costly or harmful, such as medical diagnoses or safety systems.
Complements Precision: When used alongside precision, recall provides a more comprehensive view of a model's performance, helping to balance the trade-off between false positives and false negatives.

There are also some similarities in the flaws.

Ignores False Positives: Recall does not consider false positives, which means it can be high even if the model incorrectly identifies many negative instances as positive.
May Lead to Overly Generous Models: Focusing too much on recall can result in models that frequently predict the positive class, leading to many incorrect positive predictions and reducing overall accuracy.

Considering these pros and cons can help you understand when recall is an appropriate metric for evaluating your machine-learning model.

Balancing Precision and Recall

Precision and recall are important metrics in classification. They often involve a trade-off that needs to be balanced based on the application's specific needs.

Improving precision usually means reducing recall, and the opposite is also true. For example, a model that only predicts the most certain positive cases will have high precision but may miss many positive cases, resulting in low recall.

The precision-recall curve is used to improve accuracy. It's a graphical tool that shows the relationship between precision and recall at various threshold settings. This helps visualize the trade-off between the two metrics and select the optimal threshold to balance them.

The curve is useful for imbalanced datasets, where one class is much less represented than others. In such cases, traditional metrics like accuracy can be misleading, as they often reflect the dominance of the majority class rather than the model's effectiveness at correctly identifying the minority class.

The precision-recall curve evaluates how well the model predicts the minority class by measuring the accuracy of positive predictions and detecting actual positives. It is crucial for imbalanced datasets, helping to balance precision and recall effectively.

When the curve nears the top-right corner, the model achieves high precision and recall simultaneously, indicating strong performance in distinguishing between classes, regardless of their frequency in the dataset.

Precision Vs. Recall

The choice between precision and recall depends on the specific application and the costs of errors. Both metrics provide unique insights, but their importance varies depending on the problem.

They’re valuable metrics for dealing with imbalanced datasets, common in most real-world applications. However, optimizing one metric over the other may be necessary depending on the specific use case.

For instance, avoiding false negatives is crucial in tuberculosis (TB) detection. Since TB spreads easily through the air, patients must take measures to prevent transmission. A false negative means a TB-positive patient is incorrectly assessed as disease-free, which can be harmful. In this case, false positives (healthy patients diagnosed as TB-positive) are less critical than stopping a contagious patient from spreading the disease.

In most high-risk disease detection cases, such as cancer, recall is a more crucial evaluation metric than precision.

However, precision is more valuable when we need to confirm the accuracy of our model. For instance, minimizing false positives is crucial in job applicant screening systems. False positives represent unqualified candidates incorrectly identified as suitable, potentially leading to wasted time and resources during the hiring process.

False negatives are less critical in this scenario since the job applicant screening system should only highlight candidates truly suitable for the position. If unsuitable candidates are flagged, the hiring process wastes time and resources.

Similarly, in most automated marketing campaigns, high precision is essential to ensure that many potential customers will interact with their surveys or show interest in learning more.

Conclusion

Precision and recall are vital metrics in evaluating machine learning models, each serving a unique purpose based on the application.

Balancing these metrics often involves trade-offs, and the optimal balance depends on your application's specific needs and costs. Understanding and appropriately applying precision and recall can significantly enhance the effectiveness of your machine-learning models, leading to better decision-making and outcomes.

By carefully considering the context and requirements of your specific use case, you can choose the right metric to optimize, ensuring your model performs at its best and meets your goals.