🎯 Precision & Recall — The twin metrics that never agree! ⚖️🔍

Community Article Published January 8, 2026

📖 Definition
⚡ Advantages / Disadvantages / Limitations
✅ Advantages
❌ Disadvantages
⚠️ Limitations
🛠️ Practical Tutorial: My Real Case
📊 Setup
📈 Results Obtained
🧪 Real-world Testing on GTX 1080 Ti
💡 Concrete Examples
Confusion Matrix explained
Real-world scenarios
Precision-Recall Trade-off
📋 Cheat Sheet: Precision & Recall
🔍 Essential Formulas
🛠️ When to prioritize what
⚙️ Multi-class Averaging
💻 Simplified Concept (minimal code)
📝 Summary
🎯 Conclusion
❓ Questions & Answers
🤓 Did You Know?
📖 Definition

Precision & Recall = the two best enemies of machine learning evaluation! Precision says "of all I predicted positive, how many were actually positive?" Recall says "of all actual positives, how many did I find?" It's impossible to maximize both at the same time. High precision = few false alarms. High recall = catch everything (but lots of false alarms)!

Principle:

Precision = TP / (TP + FP) = quality of positive predictions
Recall = TP / (TP + FN) = quantity of positives found
Trade-off: improve one, the other drops
F1-Score = harmonic mean (2 × Precision × Recall) / (Precision + Recall)
Context matters: spam filter ≠ cancer detection! 🧠

⚡ Advantages / Disadvantages / Limitations

✅ Advantages

Class-specific: shows performance per class (unlike accuracy)
Handles imbalance: works with 99% class A, 1% class B
Interpretable: clear business meaning
Complementary: together tell complete story
Threshold control: adjust based on business needs

❌ Disadvantages

Trade-off curse: can't maximize both simultaneously
Threshold dependent: changing threshold changes metrics
No negative class info: ignores true negatives (TN)
Confusing: which one to prioritize?
Single metric needed: F1 tries to combine but loses nuance

⚠️ Limitations

Binary focus: designed for binary classification
Aggregation issues: micro/macro/weighted averaging for multi-class
Context required: 90% precision good or bad? Depends!
Not comparable: precision=0.9 from different models ≠ same performance
Ignores confidence: predicting 51% vs 99% = same

🛠️ Practical Tutorial: My Real Case

📊 Setup

Model: ResNet-18 binary classifier (Cat vs Dog)
Dataset: 10k images (5k cats, 5k dogs) - balanced
Hardware: GTX 1080 Ti 11GB (batch size 128)
Task: Classify images as cat (positive) or dog (negative)
Epochs: 50

📈 Results Obtained

Balanced Dataset (GTX 1080 Ti, ResNet-18):

Threshold = 0.5 (default):
- Accuracy: 96.5%
- Precision: 0.965 (96.5% of "cat" predictions are correct)
- Recall: 0.967 (found 96.7% of actual cats)
- F1-Score: 0.966
- Status: Balanced, excellent ✅

Threshold = 0.9 (high confidence only):
- Accuracy: 93.2%
- Precision: 0.988 (98.8% of "cat" predictions are correct!)
- Recall: 0.878 (only found 87.8% of cats)
- F1-Score: 0.930
- Status: Few false alarms, but missed cats ⚠️

Threshold = 0.2 (aggressive):
- Accuracy: 91.8%
- Precision: 0.920 (92% of "cat" predictions correct)
- Recall: 0.996 (found 99.6% of cats!)
- F1-Score: 0.956
- Status: Catch all cats, but false alarms 🚨

Imbalanced Dataset (95% dogs, 5% cats):
No threshold adjustment:
- Accuracy: 95.8% (misleading!)
- Precision: 0.612 (lots of false alarms)
- Recall: 0.872 (finds most cats)
- F1-Score: 0.720
- Problem: Accuracy lies! ❌

With threshold adjustment (0.3):
- Accuracy: 92.1%
- Precision: 0.785
- Recall: 0.924
- F1-Score: 0.849
- Better balance! ✅

🧪 Real-world Testing on GTX 1080 Ti

Medical Diagnosis (Cancer Detection):
- Dataset: 1000 patients (950 healthy, 50 cancer)
- Model: DenseNet-121
- Batch size 64: 9.2GB VRAM used

Model A (high threshold 0.9):
- Precision: 0.95 (95% of cancer predictions correct)
- Recall: 0.68 (only found 68% of cancers)
- Missed: 16 cancer cases! ❌ UNACCEPTABLE

Model B (low threshold 0.3):
- Precision: 0.72 (72% of cancer predictions correct)
- Recall: 0.96 (found 96% of cancers!)
- False alarms: 266 healthy → cancer
- Better: false alarm > missed cancer ✅

Spam Detection (Email Filter):
- Dataset: 50k emails (45k normal, 5k spam)
- Model: BERT-base
- Batch size 32: 10.1GB VRAM used

High Precision (threshold 0.8):
- Precision: 0.98 (98% flagged emails are spam)
- Recall: 0.71 (caught 71% of spam)
- Result: Few false positives (good emails safe) ✅

High Recall (threshold 0.2):
- Precision: 0.82 (82% flagged emails are spam)
- Recall: 0.95 (caught 95% of spam!)
- Result: Lots of good emails in spam folder ❌

Face Recognition (Security System):
- Dataset: 10k faces (1k authorized, 9k unauthorized)
- Model: FaceNet
- Batch size 128: 8.7GB VRAM

Balanced (threshold 0.5):
- Precision: 0.89
- Recall: 0.87
- F1-Score: 0.88
- Training time: 2h on GTX 1080 Ti ✅

Verdict: 🎯 PRECISION & RECALL = BUSINESS DECISION, NOT JUST METRICS

💡 Concrete Examples

Confusion Matrix explained

                Predicted
                Cat    Dog
Actual  Cat     850    50   ← 900 actual cats
        Dog     30     870  ← 900 actual dogs

True Positives (TP) = 850 (correctly predicted cats)
False Positives (FP) = 30 (dogs predicted as cats)
False Negatives (FN) = 50 (cats predicted as dogs)
True Negatives (TN) = 870 (correctly predicted dogs)

Precision = TP / (TP + FP) = 850 / (850 + 30) = 0.966
"Of 880 cat predictions, 850 were actually cats"

Recall = TP / (TP + FN) = 850 / (850 + 50) = 0.944
"Of 900 actual cats, we found 850"

F1-Score = 2 × (0.966 × 0.944) / (0.966 + 0.944) = 0.955

Real-world scenarios

Cancer Detection 🏥

Priority: HIGH RECALL (catch all cancers!)
Acceptable: Low precision (false alarms okay)

Why?
- Missing cancer = patient dies ❌❌❌
- False alarm = extra test (inconvenient but safe) ✅

Target: Recall > 0.95, Precision > 0.70
Threshold: Low (0.2-0.3) to catch everything

Spam Filter 📧

Priority: HIGH PRECISION (don't lose important emails!)
Acceptable: Low recall (some spam gets through)

Why?
- False positive = lose important email ❌
- False negative = spam in inbox (annoying but not critical) ✅

Target: Precision > 0.95, Recall > 0.70
Threshold: High (0.7-0.8) to avoid false alarms

Fraud Detection 💳

Priority: BALANCED (both matter!)
Trade-off: Block real transactions vs let fraud through

Why?
- False positive = angry customer ❌
- False negative = money lost ❌

Target: F1-Score > 0.85 (balance both)
Threshold: Medium (0.4-0.6) with human review

Face Recognition (Unlock Phone) 📱

Priority: HIGH PRECISION (don't let strangers in!)
Acceptable: Low recall (owner tries again)

Why?
- False positive = stranger unlocks phone ❌❌❌
- False negative = owner retries (inconvenient) ✅

Target: Precision > 0.99, Recall > 0.80
Threshold: Very high (0.9+) for security

Precision-Recall Trade-off

Imagine: Binary classifier with confidence scores

Threshold = 0.9 (very strict):
→ Only predicts "positive" when >90% sure
→ Few predictions = High Precision (rarely wrong)
→ Misses many positives = Low Recall (catches few)

Threshold = 0.5 (balanced):
→ Predicts "positive" when >50% sure
→ Moderate Precision and Recall

Threshold = 0.1 (very loose):
→ Predicts "positive" when >10% sure  
→ Many predictions = Low Precision (lots of false alarms)
→ Catches most positives = High Recall (misses few)

Visual:
Threshold  Precision  Recall  F1
0.9        0.95       0.70    0.81
0.7        0.90       0.82    0.86
0.5        0.85       0.88    0.87 ← Best F1
0.3        0.75       0.94    0.84
0.1        0.60       0.98    0.75

📋 Cheat Sheet: Precision & Recall

🔍 Essential Formulas

True Positives (TP): Correctly predicted positive
False Positives (FP): Incorrectly predicted positive
False Negatives (FN): Missed actual positives
True Negatives (TN): Correctly predicted negative

Precision = TP / (TP + FP)
"Of all positive predictions, what % were correct?"

Recall = TP / (TP + FN)
"Of all actual positives, what % did we find?"

F1-Score = 2 × (Precision × Recall) / (Precision + Recall)
"Harmonic mean balancing both"

Accuracy = (TP + TN) / (TP + FP + FN + TN)
"Overall correctness (misleading with imbalance!)"

🛠️ When to prioritize what

HIGH PRECISION needed:
✅ Spam filter (don't lose important emails)
✅ Security alerts (reduce false alarms)
✅ Product recommendations (don't annoy users)
✅ Ad targeting (waste money on wrong audience)
→ Use high threshold (0.7-0.9)

HIGH RECALL needed:
✅ Medical diagnosis (catch all diseases)
✅ Fraud detection (catch all fraud)
✅ Safety systems (catch all hazards)
✅ Search engines (show all relevant results)
→ Use low threshold (0.2-0.4)

BALANCED (F1) needed:
✅ General classification
✅ Customer churn prediction
✅ Quality control
✅ Document classification
→ Use medium threshold (0.4-0.6)

⚙️ Multi-class Averaging

Micro-averaging:
- Pool all TP, FP, FN together
- Calculate one Precision/Recall
- Weights by class frequency
- Good for imbalanced datasets

Macro-averaging:
- Calculate Precision/Recall per class
- Average them (equal weight)
- Treats all classes equally
- Good for balanced importance

Weighted-averaging:
- Calculate Precision/Recall per class
- Weight by class frequency
- Balances frequency and importance
- Most common choice

Example (3 classes):
Class A: Precision=0.9, Recall=0.8 (100 samples)
Class B: Precision=0.7, Recall=0.9 (50 samples)
Class C: Precision=0.6, Recall=0.7 (10 samples)

Macro: (0.9+0.7+0.6)/3 = 0.73
Weighted: (0.9×100 + 0.7×50 + 0.6×10)/160 = 0.81

💻 Simplified Concept (minimal code)

import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report

class PrecisionRecallDemo:
    def basic_example(self):
        """Basic Precision & Recall calculation"""
        
        # True labels (actual)
        y_true = np.array([1, 1, 1, 1, 0, 0, 0, 0])
        
        # Predictions
        y_pred = np.array([1, 1, 1, 0, 0, 0, 1, 0])
        
        # Calculate metrics
        precision = precision_score(y_true, y_pred)
        recall = recall_score(y_true, y_pred)
        f1 = f1_score(y_true, y_pred)
        
        print(f"Precision: {precision:.3f}")
        print(f"Recall: {recall:.3f}")
        print(f"F1-Score: {f1:.3f}")
        
        # Confusion matrix
        cm = confusion_matrix(y_true, y_pred)
        print(f"\nConfusion Matrix:\n{cm}")
    
    def threshold_adjustment(self):
        """Adjust threshold to trade Precision vs Recall"""
        
        # Predicted probabilities
        y_probs = np.array([0.9, 0.8, 0.6, 0.4, 0.3, 0.2, 0.7, 0.1])
        y_true = np.array([1, 1, 1, 1, 0, 0, 0, 0])
        
        thresholds = [0.9, 0.7, 0.5, 0.3]
        
        print("Threshold  Precision  Recall  F1")
        for thresh in thresholds:
            y_pred = (y_probs >= thresh).astype(int)
            
            prec = precision_score(y_true, y_pred)
            rec = recall_score(y_true, y_pred)
            f1 = f1_score(y_true, y_pred)
            
            print(f"{thresh}        {prec:.3f}      {rec:.3f}   {f1:.3f}")
    
    def imbalanced_dataset(self):
        """Precision/Recall on imbalanced data"""
        
        # 95% class 0, 5% class 1
        y_true = np.array([0]*950 + [1]*50)
        
        # Naive model predicting all 0
        y_pred_naive = np.array([0]*1000)
        
        # Better model
        y_pred_better = np.array([0]*920 + [1]*80)
        
        print("Naive model (predict all 0):")
        print(f"Accuracy: {(y_pred_naive == y_true).mean():.3f}")
        print(f"Recall: {recall_score(y_true, y_pred_naive):.3f}")
        
        print("\nBetter model:")
        print(f"Accuracy: {(y_pred_better == y_true).mean():.3f}")
        prec = precision_score(y_true, y_pred_better)
        rec = recall_score(y_true, y_pred_better)
        print(f"Precision: {prec:.3f}")
        print(f"Recall: {rec:.3f}")
        print(f"F1-Score: {f1_score(y_true, y_pred_better):.3f}")
    
    def multi_class_example(self):
        """Multi-class Precision & Recall"""
        
        y_true = np.array([0, 0, 1, 1, 2, 2, 0, 1, 2])
        y_pred = np.array([0, 1, 1, 1, 2, 0, 0, 1, 2])
        
        # Per-class metrics
        print("Micro-average:")
        print(f"Precision: {precision_score(y_true, y_pred, average='micro'):.3f}")
        
        print("\nMacro-average:")
        print(f"Precision: {precision_score(y_true, y_pred, average='macro'):.3f}")
        
        print("\nWeighted-average:")
        print(f"Precision: {precision_score(y_true, y_pred, average='weighted'):.3f}")
        
        print("\nDetailed report:")
        print(classification_report(y_true, y_pred))

# Run examples
demo = PrecisionRecallDemo()
demo.basic_example()
demo.threshold_adjustment()
demo.imbalanced_dataset()
demo.multi_class_example()

The key concept: Precision asks "when I say yes, am I usually right?" Recall asks "of all the actual yeses, how many did I find?" You can't maximize both! It's a business decision based on cost of false positives vs false negatives! 🎯

📝 Summary

Precision & Recall = complementary metrics for classification evaluation! Precision = quality of positive predictions (TP / predicted positives). Recall = quantity of positives found (TP / actual positives). Trade-off: can't maximize both. F1-Score = harmonic mean balancing both. Context matters: cancer detection needs high recall, spam filter needs high precision. Threshold tuning on GTX 1080 Ti: adjust based on business needs! ⚖️🔍

🎯 Conclusion

Precision & Recall are fundamental for understanding classifier performance beyond simple accuracy. They reveal what kind of mistakes your model makes. The trade-off between them forces you to make business decisions: is a false positive worse than a false negative? Medical diagnosis: prioritize recall (catch all diseases). Spam filter: prioritize precision (don't lose important emails). F1-Score provides a single metric when you need balance, but always look at both individually! On GTX 1080 Ti, training with threshold tuning takes same time but gives you control over the trade-off. Remember: context is everything! 🏆⚖️

❓ Questions & Answers

Q: My model has 95% accuracy but performs terribly. What's wrong? A: Class imbalance problem! Accuracy is misleading when classes are imbalanced. Example: 95% class A, 5% class B. A dumb model predicting "always A" gets 95% accuracy but 0% recall for class B! Always check Precision & Recall for each class. Use F1-Score or confusion matrix to see the full picture. On imbalanced data, consider weighted loss, oversampling minority class, or focal loss!

Q: How do I choose the right threshold for my binary classifier? A: Business decision! Steps: (1) Plot Precision-Recall curve at different thresholds, (2) Identify cost of false positive vs false negative (example: false negative in cancer = death, false positive = extra test), (3) Choose threshold that minimizes business cost. Rule of thumb: medical/safety = low threshold (high recall), spam/security = high threshold (high precision), general = 0.5 (balanced). Test on validation set with GTX 1080 Ti to see trade-offs!

Q: F1-Score, Precision, or Recall - which should I optimize? A: Depends on your problem! Optimize: (1) Recall if false negatives are catastrophic (cancer, fraud, safety), (2) Precision if false positives are costly (spam, legal decisions, expensive actions), (3) F1-Score if both matter equally (general classification, unknown costs). Never optimize accuracy on imbalanced data! For multi-objective, use weighted F-beta score: Fβ = (1+β²) × (Precision×Recall) / (β²×Precision + Recall), where β>1 favors recall, β<1 favors precision!

🤓 Did You Know?

Precision & Recall originated from information retrieval in the 1950s-60s when researchers were building search systems for scientific papers! The problem: given a query like "machine learning", which documents should we return? Recall measured "did we find all relevant documents?" while Precision measured "are returned documents actually relevant?" The trade-off was brutal: return all documents = 100% recall but terrible precision. Return only one perfect match = 100% precision but terrible recall! Fun fact: the F1-Score (harmonic mean) was chosen over arithmetic mean because it punishes imbalanced metrics harder. Example: Precision=1.0, Recall=0.1 gives Arithmetic=0.55 but F1=0.18 (more realistic)! The modern deep learning boom rediscovered these metrics in the 2010s when accuracy failed on imbalanced datasets like medical imaging (99% healthy, 1% cancer). A classifier predicting "always healthy" got 99% accuracy but was useless! Precision/Recall revealed the truth. Today, every major ML competition (Kaggle, ImageNet, COCO) uses F1 or mAP (mean Average Precision). Even GPT-4 evaluation uses Precision/Recall for specific tasks like entity extraction! The metrics that started with library catalogs now evaluate billion-parameter models! 📚🧠⚡

Théo CHARLET

IT Systems & Networks Student - AI/ML Specialization

Creator of AG-BPE (Attention-Guided Byte-Pair Encoding)

🔗 LinkedIn: https://www.linkedin.com/in/théo-charlet

🚀 Seeking internship opportunities

🔗 Website : https://rdtvlokip.fr

🧲 Embeddings — When AI turns words into GPS coordinates! 📍🧠

March 9, 2026

🧲 Embeddings — Quand l'IA transforme les mots en coordonnées GPS ! 📍🧠

March 9, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote