🎯 Precision & Recall β€” The twin metrics that never agree! βš–οΈπŸ”

Community Article Published January 8, 2026

πŸ“– Definition

Precision & Recall = the two best enemies of machine learning evaluation! Precision says "of all I predicted positive, how many were actually positive?" Recall says "of all actual positives, how many did I find?" It's impossible to maximize both at the same time. High precision = few false alarms. High recall = catch everything (but lots of false alarms)!

Principle:

  • Precision = TP / (TP + FP) = quality of positive predictions
  • Recall = TP / (TP + FN) = quantity of positives found
  • Trade-off: improve one, the other drops
  • F1-Score = harmonic mean (2 Γ— Precision Γ— Recall) / (Precision + Recall)
  • Context matters: spam filter β‰  cancer detection! 🧠

⚑ Advantages / Disadvantages / Limitations

βœ… Advantages

  • Class-specific: shows performance per class (unlike accuracy)
  • Handles imbalance: works with 99% class A, 1% class B
  • Interpretable: clear business meaning
  • Complementary: together tell complete story
  • Threshold control: adjust based on business needs

❌ Disadvantages

  • Trade-off curse: can't maximize both simultaneously
  • Threshold dependent: changing threshold changes metrics
  • No negative class info: ignores true negatives (TN)
  • Confusing: which one to prioritize?
  • Single metric needed: F1 tries to combine but loses nuance

⚠️ Limitations

  • Binary focus: designed for binary classification
  • Aggregation issues: micro/macro/weighted averaging for multi-class
  • Context required: 90% precision good or bad? Depends!
  • Not comparable: precision=0.9 from different models β‰  same performance
  • Ignores confidence: predicting 51% vs 99% = same

πŸ› οΈ Practical Tutorial: My Real Case

πŸ“Š Setup

  • Model: ResNet-18 binary classifier (Cat vs Dog)
  • Dataset: 10k images (5k cats, 5k dogs) - balanced
  • Hardware: GTX 1080 Ti 11GB (batch size 128)
  • Task: Classify images as cat (positive) or dog (negative)
  • Epochs: 50

πŸ“ˆ Results Obtained

Balanced Dataset (GTX 1080 Ti, ResNet-18):

Threshold = 0.5 (default):
- Accuracy: 96.5%
- Precision: 0.965 (96.5% of "cat" predictions are correct)
- Recall: 0.967 (found 96.7% of actual cats)
- F1-Score: 0.966
- Status: Balanced, excellent βœ…

Threshold = 0.9 (high confidence only):
- Accuracy: 93.2%
- Precision: 0.988 (98.8% of "cat" predictions are correct!)
- Recall: 0.878 (only found 87.8% of cats)
- F1-Score: 0.930
- Status: Few false alarms, but missed cats ⚠️

Threshold = 0.2 (aggressive):
- Accuracy: 91.8%
- Precision: 0.920 (92% of "cat" predictions correct)
- Recall: 0.996 (found 99.6% of cats!)
- F1-Score: 0.956
- Status: Catch all cats, but false alarms 🚨

Imbalanced Dataset (95% dogs, 5% cats):
No threshold adjustment:
- Accuracy: 95.8% (misleading!)
- Precision: 0.612 (lots of false alarms)
- Recall: 0.872 (finds most cats)
- F1-Score: 0.720
- Problem: Accuracy lies! ❌

With threshold adjustment (0.3):
- Accuracy: 92.1%
- Precision: 0.785
- Recall: 0.924
- F1-Score: 0.849
- Better balance! βœ…

πŸ§ͺ Real-world Testing on GTX 1080 Ti

Medical Diagnosis (Cancer Detection):
- Dataset: 1000 patients (950 healthy, 50 cancer)
- Model: DenseNet-121
- Batch size 64: 9.2GB VRAM used

Model A (high threshold 0.9):
- Precision: 0.95 (95% of cancer predictions correct)
- Recall: 0.68 (only found 68% of cancers)
- Missed: 16 cancer cases! ❌ UNACCEPTABLE

Model B (low threshold 0.3):
- Precision: 0.72 (72% of cancer predictions correct)
- Recall: 0.96 (found 96% of cancers!)
- False alarms: 266 healthy β†’ cancer
- Better: false alarm > missed cancer βœ…

Spam Detection (Email Filter):
- Dataset: 50k emails (45k normal, 5k spam)
- Model: BERT-base
- Batch size 32: 10.1GB VRAM used

High Precision (threshold 0.8):
- Precision: 0.98 (98% flagged emails are spam)
- Recall: 0.71 (caught 71% of spam)
- Result: Few false positives (good emails safe) βœ…

High Recall (threshold 0.2):
- Precision: 0.82 (82% flagged emails are spam)
- Recall: 0.95 (caught 95% of spam!)
- Result: Lots of good emails in spam folder ❌

Face Recognition (Security System):
- Dataset: 10k faces (1k authorized, 9k unauthorized)
- Model: FaceNet
- Batch size 128: 8.7GB VRAM

Balanced (threshold 0.5):
- Precision: 0.89
- Recall: 0.87
- F1-Score: 0.88
- Training time: 2h on GTX 1080 Ti βœ…

Verdict: 🎯 PRECISION & RECALL = BUSINESS DECISION, NOT JUST METRICS


πŸ’‘ Concrete Examples

Confusion Matrix explained

                Predicted
                Cat    Dog
Actual  Cat     850    50   ← 900 actual cats
        Dog     30     870  ← 900 actual dogs

True Positives (TP) = 850 (correctly predicted cats)
False Positives (FP) = 30 (dogs predicted as cats)
False Negatives (FN) = 50 (cats predicted as dogs)
True Negatives (TN) = 870 (correctly predicted dogs)

Precision = TP / (TP + FP) = 850 / (850 + 30) = 0.966
"Of 880 cat predictions, 850 were actually cats"

Recall = TP / (TP + FN) = 850 / (850 + 50) = 0.944
"Of 900 actual cats, we found 850"

F1-Score = 2 Γ— (0.966 Γ— 0.944) / (0.966 + 0.944) = 0.955

Real-world scenarios

Cancer Detection πŸ₯

Priority: HIGH RECALL (catch all cancers!)
Acceptable: Low precision (false alarms okay)

Why?
- Missing cancer = patient dies ❌❌❌
- False alarm = extra test (inconvenient but safe) βœ…

Target: Recall > 0.95, Precision > 0.70
Threshold: Low (0.2-0.3) to catch everything

Spam Filter πŸ“§

Priority: HIGH PRECISION (don't lose important emails!)
Acceptable: Low recall (some spam gets through)

Why?
- False positive = lose important email ❌
- False negative = spam in inbox (annoying but not critical) βœ…

Target: Precision > 0.95, Recall > 0.70
Threshold: High (0.7-0.8) to avoid false alarms

Fraud Detection πŸ’³

Priority: BALANCED (both matter!)
Trade-off: Block real transactions vs let fraud through

Why?
- False positive = angry customer ❌
- False negative = money lost ❌

Target: F1-Score > 0.85 (balance both)
Threshold: Medium (0.4-0.6) with human review

Face Recognition (Unlock Phone) πŸ“±

Priority: HIGH PRECISION (don't let strangers in!)
Acceptable: Low recall (owner tries again)

Why?
- False positive = stranger unlocks phone ❌❌❌
- False negative = owner retries (inconvenient) βœ…

Target: Precision > 0.99, Recall > 0.80
Threshold: Very high (0.9+) for security

Precision-Recall Trade-off

Imagine: Binary classifier with confidence scores

Threshold = 0.9 (very strict):
β†’ Only predicts "positive" when >90% sure
β†’ Few predictions = High Precision (rarely wrong)
β†’ Misses many positives = Low Recall (catches few)

Threshold = 0.5 (balanced):
β†’ Predicts "positive" when >50% sure
β†’ Moderate Precision and Recall

Threshold = 0.1 (very loose):
β†’ Predicts "positive" when >10% sure  
β†’ Many predictions = Low Precision (lots of false alarms)
β†’ Catches most positives = High Recall (misses few)

Visual:
Threshold  Precision  Recall  F1
0.9        0.95       0.70    0.81
0.7        0.90       0.82    0.86
0.5        0.85       0.88    0.87 ← Best F1
0.3        0.75       0.94    0.84
0.1        0.60       0.98    0.75

πŸ“‹ Cheat Sheet: Precision & Recall

πŸ” Essential Formulas

True Positives (TP): Correctly predicted positive
False Positives (FP): Incorrectly predicted positive
False Negatives (FN): Missed actual positives
True Negatives (TN): Correctly predicted negative

Precision = TP / (TP + FP)
"Of all positive predictions, what % were correct?"

Recall = TP / (TP + FN)
"Of all actual positives, what % did we find?"

F1-Score = 2 Γ— (Precision Γ— Recall) / (Precision + Recall)
"Harmonic mean balancing both"

Accuracy = (TP + TN) / (TP + FP + FN + TN)
"Overall correctness (misleading with imbalance!)"

πŸ› οΈ When to prioritize what

HIGH PRECISION needed:
βœ… Spam filter (don't lose important emails)
βœ… Security alerts (reduce false alarms)
βœ… Product recommendations (don't annoy users)
βœ… Ad targeting (waste money on wrong audience)
β†’ Use high threshold (0.7-0.9)

HIGH RECALL needed:
βœ… Medical diagnosis (catch all diseases)
βœ… Fraud detection (catch all fraud)
βœ… Safety systems (catch all hazards)
βœ… Search engines (show all relevant results)
β†’ Use low threshold (0.2-0.4)

BALANCED (F1) needed:
βœ… General classification
βœ… Customer churn prediction
βœ… Quality control
βœ… Document classification
β†’ Use medium threshold (0.4-0.6)

βš™οΈ Multi-class Averaging

Micro-averaging:
- Pool all TP, FP, FN together
- Calculate one Precision/Recall
- Weights by class frequency
- Good for imbalanced datasets

Macro-averaging:
- Calculate Precision/Recall per class
- Average them (equal weight)
- Treats all classes equally
- Good for balanced importance

Weighted-averaging:
- Calculate Precision/Recall per class
- Weight by class frequency
- Balances frequency and importance
- Most common choice

Example (3 classes):
Class A: Precision=0.9, Recall=0.8 (100 samples)
Class B: Precision=0.7, Recall=0.9 (50 samples)
Class C: Precision=0.6, Recall=0.7 (10 samples)

Macro: (0.9+0.7+0.6)/3 = 0.73
Weighted: (0.9Γ—100 + 0.7Γ—50 + 0.6Γ—10)/160 = 0.81

πŸ’» Simplified Concept (minimal code)

import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report

class PrecisionRecallDemo:
    def basic_example(self):
        """Basic Precision & Recall calculation"""
        
        # True labels (actual)
        y_true = np.array([1, 1, 1, 1, 0, 0, 0, 0])
        
        # Predictions
        y_pred = np.array([1, 1, 1, 0, 0, 0, 1, 0])
        
        # Calculate metrics
        precision = precision_score(y_true, y_pred)
        recall = recall_score(y_true, y_pred)
        f1 = f1_score(y_true, y_pred)
        
        print(f"Precision: {precision:.3f}")
        print(f"Recall: {recall:.3f}")
        print(f"F1-Score: {f1:.3f}")
        
        # Confusion matrix
        cm = confusion_matrix(y_true, y_pred)
        print(f"\nConfusion Matrix:\n{cm}")
    
    def threshold_adjustment(self):
        """Adjust threshold to trade Precision vs Recall"""
        
        # Predicted probabilities
        y_probs = np.array([0.9, 0.8, 0.6, 0.4, 0.3, 0.2, 0.7, 0.1])
        y_true = np.array([1, 1, 1, 1, 0, 0, 0, 0])
        
        thresholds = [0.9, 0.7, 0.5, 0.3]
        
        print("Threshold  Precision  Recall  F1")
        for thresh in thresholds:
            y_pred = (y_probs >= thresh).astype(int)
            
            prec = precision_score(y_true, y_pred)
            rec = recall_score(y_true, y_pred)
            f1 = f1_score(y_true, y_pred)
            
            print(f"{thresh}        {prec:.3f}      {rec:.3f}   {f1:.3f}")
    
    def imbalanced_dataset(self):
        """Precision/Recall on imbalanced data"""
        
        # 95% class 0, 5% class 1
        y_true = np.array([0]*950 + [1]*50)
        
        # Naive model predicting all 0
        y_pred_naive = np.array([0]*1000)
        
        # Better model
        y_pred_better = np.array([0]*920 + [1]*80)
        
        print("Naive model (predict all 0):")
        print(f"Accuracy: {(y_pred_naive == y_true).mean():.3f}")
        print(f"Recall: {recall_score(y_true, y_pred_naive):.3f}")
        
        print("\nBetter model:")
        print(f"Accuracy: {(y_pred_better == y_true).mean():.3f}")
        prec = precision_score(y_true, y_pred_better)
        rec = recall_score(y_true, y_pred_better)
        print(f"Precision: {prec:.3f}")
        print(f"Recall: {rec:.3f}")
        print(f"F1-Score: {f1_score(y_true, y_pred_better):.3f}")
    
    def multi_class_example(self):
        """Multi-class Precision & Recall"""
        
        y_true = np.array([0, 0, 1, 1, 2, 2, 0, 1, 2])
        y_pred = np.array([0, 1, 1, 1, 2, 0, 0, 1, 2])
        
        # Per-class metrics
        print("Micro-average:")
        print(f"Precision: {precision_score(y_true, y_pred, average='micro'):.3f}")
        
        print("\nMacro-average:")
        print(f"Precision: {precision_score(y_true, y_pred, average='macro'):.3f}")
        
        print("\nWeighted-average:")
        print(f"Precision: {precision_score(y_true, y_pred, average='weighted'):.3f}")
        
        print("\nDetailed report:")
        print(classification_report(y_true, y_pred))

# Run examples
demo = PrecisionRecallDemo()
demo.basic_example()
demo.threshold_adjustment()
demo.imbalanced_dataset()
demo.multi_class_example()

The key concept: Precision asks "when I say yes, am I usually right?" Recall asks "of all the actual yeses, how many did I find?" You can't maximize both! It's a business decision based on cost of false positives vs false negatives! 🎯


πŸ“ Summary

Precision & Recall = complementary metrics for classification evaluation! Precision = quality of positive predictions (TP / predicted positives). Recall = quantity of positives found (TP / actual positives). Trade-off: can't maximize both. F1-Score = harmonic mean balancing both. Context matters: cancer detection needs high recall, spam filter needs high precision. Threshold tuning on GTX 1080 Ti: adjust based on business needs! βš–οΈπŸ”


🎯 Conclusion

Precision & Recall are fundamental for understanding classifier performance beyond simple accuracy. They reveal what kind of mistakes your model makes. The trade-off between them forces you to make business decisions: is a false positive worse than a false negative? Medical diagnosis: prioritize recall (catch all diseases). Spam filter: prioritize precision (don't lose important emails). F1-Score provides a single metric when you need balance, but always look at both individually! On GTX 1080 Ti, training with threshold tuning takes same time but gives you control over the trade-off. Remember: context is everything! πŸ†βš–οΈ


❓ Questions & Answers

Q: My model has 95% accuracy but performs terribly. What's wrong? A: Class imbalance problem! Accuracy is misleading when classes are imbalanced. Example: 95% class A, 5% class B. A dumb model predicting "always A" gets 95% accuracy but 0% recall for class B! Always check Precision & Recall for each class. Use F1-Score or confusion matrix to see the full picture. On imbalanced data, consider weighted loss, oversampling minority class, or focal loss!

Q: How do I choose the right threshold for my binary classifier? A: Business decision! Steps: (1) Plot Precision-Recall curve at different thresholds, (2) Identify cost of false positive vs false negative (example: false negative in cancer = death, false positive = extra test), (3) Choose threshold that minimizes business cost. Rule of thumb: medical/safety = low threshold (high recall), spam/security = high threshold (high precision), general = 0.5 (balanced). Test on validation set with GTX 1080 Ti to see trade-offs!

Q: F1-Score, Precision, or Recall - which should I optimize? A: Depends on your problem! Optimize: (1) Recall if false negatives are catastrophic (cancer, fraud, safety), (2) Precision if false positives are costly (spam, legal decisions, expensive actions), (3) F1-Score if both matter equally (general classification, unknown costs). Never optimize accuracy on imbalanced data! For multi-objective, use weighted F-beta score: FΞ² = (1+Ξ²Β²) Γ— (PrecisionΓ—Recall) / (Ξ²Β²Γ—Precision + Recall), where Ξ²>1 favors recall, Ξ²<1 favors precision!


πŸ€“ Did You Know?

Precision & Recall originated from information retrieval in the 1950s-60s when researchers were building search systems for scientific papers! The problem: given a query like "machine learning", which documents should we return? Recall measured "did we find all relevant documents?" while Precision measured "are returned documents actually relevant?" The trade-off was brutal: return all documents = 100% recall but terrible precision. Return only one perfect match = 100% precision but terrible recall! Fun fact: the F1-Score (harmonic mean) was chosen over arithmetic mean because it punishes imbalanced metrics harder. Example: Precision=1.0, Recall=0.1 gives Arithmetic=0.55 but F1=0.18 (more realistic)! The modern deep learning boom rediscovered these metrics in the 2010s when accuracy failed on imbalanced datasets like medical imaging (99% healthy, 1% cancer). A classifier predicting "always healthy" got 99% accuracy but was useless! Precision/Recall revealed the truth. Today, every major ML competition (Kaggle, ImageNet, COCO) uses F1 or mAP (mean Average Precision). Even GPT-4 evaluation uses Precision/Recall for specific tasks like entity extraction! The metrics that started with library catalogs now evaluate billion-parameter models! πŸ“šπŸ§ βš‘


ThΓ©o CHARLET

IT Systems & Networks Student - AI/ML Specialization

Creator of AG-BPE (Attention-Guided Byte-Pair Encoding)

πŸ”— LinkedIn: https://www.linkedin.com/in/thΓ©o-charlet

πŸš€ Seeking internship opportunities

πŸ”— Website : https://rdtvlokip.fr

Community

Sign up or log in to comment