π― Precision & Recall β The twin metrics that never agree! βοΈπ
π Definition
Precision & Recall = the two best enemies of machine learning evaluation! Precision says "of all I predicted positive, how many were actually positive?" Recall says "of all actual positives, how many did I find?" It's impossible to maximize both at the same time. High precision = few false alarms. High recall = catch everything (but lots of false alarms)!
Principle:
- Precision = TP / (TP + FP) = quality of positive predictions
- Recall = TP / (TP + FN) = quantity of positives found
- Trade-off: improve one, the other drops
- F1-Score = harmonic mean (2 Γ Precision Γ Recall) / (Precision + Recall)
- Context matters: spam filter β cancer detection! π§
β‘ Advantages / Disadvantages / Limitations
β Advantages
- Class-specific: shows performance per class (unlike accuracy)
- Handles imbalance: works with 99% class A, 1% class B
- Interpretable: clear business meaning
- Complementary: together tell complete story
- Threshold control: adjust based on business needs
β Disadvantages
- Trade-off curse: can't maximize both simultaneously
- Threshold dependent: changing threshold changes metrics
- No negative class info: ignores true negatives (TN)
- Confusing: which one to prioritize?
- Single metric needed: F1 tries to combine but loses nuance
β οΈ Limitations
- Binary focus: designed for binary classification
- Aggregation issues: micro/macro/weighted averaging for multi-class
- Context required: 90% precision good or bad? Depends!
- Not comparable: precision=0.9 from different models β same performance
- Ignores confidence: predicting 51% vs 99% = same
π οΈ Practical Tutorial: My Real Case
π Setup
- Model: ResNet-18 binary classifier (Cat vs Dog)
- Dataset: 10k images (5k cats, 5k dogs) - balanced
- Hardware: GTX 1080 Ti 11GB (batch size 128)
- Task: Classify images as cat (positive) or dog (negative)
- Epochs: 50
π Results Obtained
Balanced Dataset (GTX 1080 Ti, ResNet-18):
Threshold = 0.5 (default):
- Accuracy: 96.5%
- Precision: 0.965 (96.5% of "cat" predictions are correct)
- Recall: 0.967 (found 96.7% of actual cats)
- F1-Score: 0.966
- Status: Balanced, excellent β
Threshold = 0.9 (high confidence only):
- Accuracy: 93.2%
- Precision: 0.988 (98.8% of "cat" predictions are correct!)
- Recall: 0.878 (only found 87.8% of cats)
- F1-Score: 0.930
- Status: Few false alarms, but missed cats β οΈ
Threshold = 0.2 (aggressive):
- Accuracy: 91.8%
- Precision: 0.920 (92% of "cat" predictions correct)
- Recall: 0.996 (found 99.6% of cats!)
- F1-Score: 0.956
- Status: Catch all cats, but false alarms π¨
Imbalanced Dataset (95% dogs, 5% cats):
No threshold adjustment:
- Accuracy: 95.8% (misleading!)
- Precision: 0.612 (lots of false alarms)
- Recall: 0.872 (finds most cats)
- F1-Score: 0.720
- Problem: Accuracy lies! β
With threshold adjustment (0.3):
- Accuracy: 92.1%
- Precision: 0.785
- Recall: 0.924
- F1-Score: 0.849
- Better balance! β
π§ͺ Real-world Testing on GTX 1080 Ti
Medical Diagnosis (Cancer Detection):
- Dataset: 1000 patients (950 healthy, 50 cancer)
- Model: DenseNet-121
- Batch size 64: 9.2GB VRAM used
Model A (high threshold 0.9):
- Precision: 0.95 (95% of cancer predictions correct)
- Recall: 0.68 (only found 68% of cancers)
- Missed: 16 cancer cases! β UNACCEPTABLE
Model B (low threshold 0.3):
- Precision: 0.72 (72% of cancer predictions correct)
- Recall: 0.96 (found 96% of cancers!)
- False alarms: 266 healthy β cancer
- Better: false alarm > missed cancer β
Spam Detection (Email Filter):
- Dataset: 50k emails (45k normal, 5k spam)
- Model: BERT-base
- Batch size 32: 10.1GB VRAM used
High Precision (threshold 0.8):
- Precision: 0.98 (98% flagged emails are spam)
- Recall: 0.71 (caught 71% of spam)
- Result: Few false positives (good emails safe) β
High Recall (threshold 0.2):
- Precision: 0.82 (82% flagged emails are spam)
- Recall: 0.95 (caught 95% of spam!)
- Result: Lots of good emails in spam folder β
Face Recognition (Security System):
- Dataset: 10k faces (1k authorized, 9k unauthorized)
- Model: FaceNet
- Batch size 128: 8.7GB VRAM
Balanced (threshold 0.5):
- Precision: 0.89
- Recall: 0.87
- F1-Score: 0.88
- Training time: 2h on GTX 1080 Ti β
Verdict: π― PRECISION & RECALL = BUSINESS DECISION, NOT JUST METRICS
π‘ Concrete Examples
Confusion Matrix explained
Predicted
Cat Dog
Actual Cat 850 50 β 900 actual cats
Dog 30 870 β 900 actual dogs
True Positives (TP) = 850 (correctly predicted cats)
False Positives (FP) = 30 (dogs predicted as cats)
False Negatives (FN) = 50 (cats predicted as dogs)
True Negatives (TN) = 870 (correctly predicted dogs)
Precision = TP / (TP + FP) = 850 / (850 + 30) = 0.966
"Of 880 cat predictions, 850 were actually cats"
Recall = TP / (TP + FN) = 850 / (850 + 50) = 0.944
"Of 900 actual cats, we found 850"
F1-Score = 2 Γ (0.966 Γ 0.944) / (0.966 + 0.944) = 0.955
Real-world scenarios
Cancer Detection π₯
Priority: HIGH RECALL (catch all cancers!)
Acceptable: Low precision (false alarms okay)
Why?
- Missing cancer = patient dies βββ
- False alarm = extra test (inconvenient but safe) β
Target: Recall > 0.95, Precision > 0.70
Threshold: Low (0.2-0.3) to catch everything
Spam Filter π§
Priority: HIGH PRECISION (don't lose important emails!)
Acceptable: Low recall (some spam gets through)
Why?
- False positive = lose important email β
- False negative = spam in inbox (annoying but not critical) β
Target: Precision > 0.95, Recall > 0.70
Threshold: High (0.7-0.8) to avoid false alarms
Fraud Detection π³
Priority: BALANCED (both matter!)
Trade-off: Block real transactions vs let fraud through
Why?
- False positive = angry customer β
- False negative = money lost β
Target: F1-Score > 0.85 (balance both)
Threshold: Medium (0.4-0.6) with human review
Face Recognition (Unlock Phone) π±
Priority: HIGH PRECISION (don't let strangers in!)
Acceptable: Low recall (owner tries again)
Why?
- False positive = stranger unlocks phone βββ
- False negative = owner retries (inconvenient) β
Target: Precision > 0.99, Recall > 0.80
Threshold: Very high (0.9+) for security
Precision-Recall Trade-off
Imagine: Binary classifier with confidence scores
Threshold = 0.9 (very strict):
β Only predicts "positive" when >90% sure
β Few predictions = High Precision (rarely wrong)
β Misses many positives = Low Recall (catches few)
Threshold = 0.5 (balanced):
β Predicts "positive" when >50% sure
β Moderate Precision and Recall
Threshold = 0.1 (very loose):
β Predicts "positive" when >10% sure
β Many predictions = Low Precision (lots of false alarms)
β Catches most positives = High Recall (misses few)
Visual:
Threshold Precision Recall F1
0.9 0.95 0.70 0.81
0.7 0.90 0.82 0.86
0.5 0.85 0.88 0.87 β Best F1
0.3 0.75 0.94 0.84
0.1 0.60 0.98 0.75
π Cheat Sheet: Precision & Recall
π Essential Formulas
True Positives (TP): Correctly predicted positive
False Positives (FP): Incorrectly predicted positive
False Negatives (FN): Missed actual positives
True Negatives (TN): Correctly predicted negative
Precision = TP / (TP + FP)
"Of all positive predictions, what % were correct?"
Recall = TP / (TP + FN)
"Of all actual positives, what % did we find?"
F1-Score = 2 Γ (Precision Γ Recall) / (Precision + Recall)
"Harmonic mean balancing both"
Accuracy = (TP + TN) / (TP + FP + FN + TN)
"Overall correctness (misleading with imbalance!)"
π οΈ When to prioritize what
HIGH PRECISION needed:
β
Spam filter (don't lose important emails)
β
Security alerts (reduce false alarms)
β
Product recommendations (don't annoy users)
β
Ad targeting (waste money on wrong audience)
β Use high threshold (0.7-0.9)
HIGH RECALL needed:
β
Medical diagnosis (catch all diseases)
β
Fraud detection (catch all fraud)
β
Safety systems (catch all hazards)
β
Search engines (show all relevant results)
β Use low threshold (0.2-0.4)
BALANCED (F1) needed:
β
General classification
β
Customer churn prediction
β
Quality control
β
Document classification
β Use medium threshold (0.4-0.6)
βοΈ Multi-class Averaging
Micro-averaging:
- Pool all TP, FP, FN together
- Calculate one Precision/Recall
- Weights by class frequency
- Good for imbalanced datasets
Macro-averaging:
- Calculate Precision/Recall per class
- Average them (equal weight)
- Treats all classes equally
- Good for balanced importance
Weighted-averaging:
- Calculate Precision/Recall per class
- Weight by class frequency
- Balances frequency and importance
- Most common choice
Example (3 classes):
Class A: Precision=0.9, Recall=0.8 (100 samples)
Class B: Precision=0.7, Recall=0.9 (50 samples)
Class C: Precision=0.6, Recall=0.7 (10 samples)
Macro: (0.9+0.7+0.6)/3 = 0.73
Weighted: (0.9Γ100 + 0.7Γ50 + 0.6Γ10)/160 = 0.81
π» Simplified Concept (minimal code)
import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report
class PrecisionRecallDemo:
def basic_example(self):
"""Basic Precision & Recall calculation"""
# True labels (actual)
y_true = np.array([1, 1, 1, 1, 0, 0, 0, 0])
# Predictions
y_pred = np.array([1, 1, 1, 0, 0, 0, 1, 0])
# Calculate metrics
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print(f"F1-Score: {f1:.3f}")
# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print(f"\nConfusion Matrix:\n{cm}")
def threshold_adjustment(self):
"""Adjust threshold to trade Precision vs Recall"""
# Predicted probabilities
y_probs = np.array([0.9, 0.8, 0.6, 0.4, 0.3, 0.2, 0.7, 0.1])
y_true = np.array([1, 1, 1, 1, 0, 0, 0, 0])
thresholds = [0.9, 0.7, 0.5, 0.3]
print("Threshold Precision Recall F1")
for thresh in thresholds:
y_pred = (y_probs >= thresh).astype(int)
prec = precision_score(y_true, y_pred)
rec = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
print(f"{thresh} {prec:.3f} {rec:.3f} {f1:.3f}")
def imbalanced_dataset(self):
"""Precision/Recall on imbalanced data"""
# 95% class 0, 5% class 1
y_true = np.array([0]*950 + [1]*50)
# Naive model predicting all 0
y_pred_naive = np.array([0]*1000)
# Better model
y_pred_better = np.array([0]*920 + [1]*80)
print("Naive model (predict all 0):")
print(f"Accuracy: {(y_pred_naive == y_true).mean():.3f}")
print(f"Recall: {recall_score(y_true, y_pred_naive):.3f}")
print("\nBetter model:")
print(f"Accuracy: {(y_pred_better == y_true).mean():.3f}")
prec = precision_score(y_true, y_pred_better)
rec = recall_score(y_true, y_pred_better)
print(f"Precision: {prec:.3f}")
print(f"Recall: {rec:.3f}")
print(f"F1-Score: {f1_score(y_true, y_pred_better):.3f}")
def multi_class_example(self):
"""Multi-class Precision & Recall"""
y_true = np.array([0, 0, 1, 1, 2, 2, 0, 1, 2])
y_pred = np.array([0, 1, 1, 1, 2, 0, 0, 1, 2])
# Per-class metrics
print("Micro-average:")
print(f"Precision: {precision_score(y_true, y_pred, average='micro'):.3f}")
print("\nMacro-average:")
print(f"Precision: {precision_score(y_true, y_pred, average='macro'):.3f}")
print("\nWeighted-average:")
print(f"Precision: {precision_score(y_true, y_pred, average='weighted'):.3f}")
print("\nDetailed report:")
print(classification_report(y_true, y_pred))
# Run examples
demo = PrecisionRecallDemo()
demo.basic_example()
demo.threshold_adjustment()
demo.imbalanced_dataset()
demo.multi_class_example()
The key concept: Precision asks "when I say yes, am I usually right?" Recall asks "of all the actual yeses, how many did I find?" You can't maximize both! It's a business decision based on cost of false positives vs false negatives! π―
π Summary
Precision & Recall = complementary metrics for classification evaluation! Precision = quality of positive predictions (TP / predicted positives). Recall = quantity of positives found (TP / actual positives). Trade-off: can't maximize both. F1-Score = harmonic mean balancing both. Context matters: cancer detection needs high recall, spam filter needs high precision. Threshold tuning on GTX 1080 Ti: adjust based on business needs! βοΈπ
π― Conclusion
Precision & Recall are fundamental for understanding classifier performance beyond simple accuracy. They reveal what kind of mistakes your model makes. The trade-off between them forces you to make business decisions: is a false positive worse than a false negative? Medical diagnosis: prioritize recall (catch all diseases). Spam filter: prioritize precision (don't lose important emails). F1-Score provides a single metric when you need balance, but always look at both individually! On GTX 1080 Ti, training with threshold tuning takes same time but gives you control over the trade-off. Remember: context is everything! πβοΈ
β Questions & Answers
Q: My model has 95% accuracy but performs terribly. What's wrong? A: Class imbalance problem! Accuracy is misleading when classes are imbalanced. Example: 95% class A, 5% class B. A dumb model predicting "always A" gets 95% accuracy but 0% recall for class B! Always check Precision & Recall for each class. Use F1-Score or confusion matrix to see the full picture. On imbalanced data, consider weighted loss, oversampling minority class, or focal loss!
Q: How do I choose the right threshold for my binary classifier? A: Business decision! Steps: (1) Plot Precision-Recall curve at different thresholds, (2) Identify cost of false positive vs false negative (example: false negative in cancer = death, false positive = extra test), (3) Choose threshold that minimizes business cost. Rule of thumb: medical/safety = low threshold (high recall), spam/security = high threshold (high precision), general = 0.5 (balanced). Test on validation set with GTX 1080 Ti to see trade-offs!
Q: F1-Score, Precision, or Recall - which should I optimize? A: Depends on your problem! Optimize: (1) Recall if false negatives are catastrophic (cancer, fraud, safety), (2) Precision if false positives are costly (spam, legal decisions, expensive actions), (3) F1-Score if both matter equally (general classification, unknown costs). Never optimize accuracy on imbalanced data! For multi-objective, use weighted F-beta score: FΞ² = (1+Ξ²Β²) Γ (PrecisionΓRecall) / (Ξ²Β²ΓPrecision + Recall), where Ξ²>1 favors recall, Ξ²<1 favors precision!
π€ Did You Know?
Precision & Recall originated from information retrieval in the 1950s-60s when researchers were building search systems for scientific papers! The problem: given a query like "machine learning", which documents should we return? Recall measured "did we find all relevant documents?" while Precision measured "are returned documents actually relevant?" The trade-off was brutal: return all documents = 100% recall but terrible precision. Return only one perfect match = 100% precision but terrible recall! Fun fact: the F1-Score (harmonic mean) was chosen over arithmetic mean because it punishes imbalanced metrics harder. Example: Precision=1.0, Recall=0.1 gives Arithmetic=0.55 but F1=0.18 (more realistic)! The modern deep learning boom rediscovered these metrics in the 2010s when accuracy failed on imbalanced datasets like medical imaging (99% healthy, 1% cancer). A classifier predicting "always healthy" got 99% accuracy but was useless! Precision/Recall revealed the truth. Today, every major ML competition (Kaggle, ImageNet, COCO) uses F1 or mAP (mean Average Precision). Even GPT-4 evaluation uses Precision/Recall for specific tasks like entity extraction! The metrics that started with library catalogs now evaluate billion-parameter models! ππ§ β‘
ThΓ©o CHARLET
IT Systems & Networks Student - AI/ML Specialization
Creator of AG-BPE (Attention-Guided Byte-Pair Encoding)
π LinkedIn: https://www.linkedin.com/in/thΓ©o-charlet
π Seeking internship opportunities
π Website : https://rdtvlokip.fr