Eye MedSigLIP Linear Probe
Summary
This repository provides an eye (conjunctiva) anemia classifier built on a frozen MedSigLIP vision encoder with a lightweight linear probe (logistic regression converted to a torch linear head). The model outputs a triage score and a binary prediction (Anemia vs Non-Anemia).
This work follows Google’s recommended data‑efficient workflow: freeze the vision encoder, extract embeddings, and train a lightweight linear classifier on top. The linear probe is trained with scikit‑learn LogisticRegression using the saga solver (data‑efficient and scalable), then converted into a torch linear head for deployment. The training code also handles image decoding failures (Pillow -> OpenCV fallback), optional preprocessing toggles, and supports threshold calibration for high‑recall deployment.
Model Artifacts
The repository includes the full deployable bundle:
artifacts/vision_model/(MedSigLIP vision encoder weights + config)artifacts/linear_head.pt(linear probe head)artifacts/scaler.joblib(mean/std for embedding standardization)artifacts/config.json(threshold + preprocessing flags)artifacts/optimized/(color constancy utilities)
Default deployment threshold (from artifacts/config.json): 0.31.
Dataset (Counts Only)
Primary dataset folder used in this project:
Dataset/dataset anemia/with class foldersAnemiaandNon-Anemia.
Full Cleaned Dataset (Image-Level)
- Train: 686 images (Anemia: 280, Non-Anemia: 406)
- Test: 172 images (Anemia: 82, Non-Anemia: 90)
- Image decoding: train opencv_rescued 157, test opencv_rescued 30
Full-Dataset Linear Probe (Latest Run)
- Train/Val/Test counts: 9283 / 1160 / 1162
- Class balance:
- Train: Anemia 4730, Non-Anemia 4553
- Val: Anemia 591, Non-Anemia 569
- Test: Anemia 592, Non-Anemia 570
Segmented-Only Optimized Split (CP-AnemiC + segmented)
- Train/Val/Test: 800 / 100 / 100
- Class balance:
- Train: Anemia 400, Non-Anemia 400
- Val: Anemia 50, Non-Anemia 50
- Test: Anemia 50, Non-Anemia 50
Preprocessing
- Resize to 448x448 using letterbox padding (no stretching) unless stated otherwise.
- RGB conversion and normalization to [-1, 1].
- OpenCV fallback for decoding certain PNGs.
Training Code Notes
- Embedding extraction: frozen MedSigLIP vision encoder (no fine‑tuning).
- Classifier: scikit‑learn
LogisticRegressionwithsolver=\"saga\". - Standardization: embedding mean/std saved to
scaler.joblib. - Deployment conversion: learned coefficients are converted to a torch
Linearhead and saved aslinear_head.pt.
Core Training Code (Key Steps)
# 1) Extract frozen MedSigLIP embeddings
vision_model.eval()
with torch.no_grad():
outputs = vision_model(pixel_values=tensor)
embeds = outputs.pooler_output
embeds = embeds / embeds.norm(p=2, dim=-1, keepdim=True)
# 2) Standardize embeddings (mean/std from train set)
x_std = (x - scaler["mean"]) / scaler["std"]
# 3) Train linear probe (logistic regression, saga)
model = LogisticRegression(
solver="saga",
max_iter=5000,
C=best_c,
n_jobs=-1,
)
model.fit(x_train_std, y_train)
# 4) Convert to torch Linear head for deployment
linear_head = torch.nn.Linear(vision_model.config.hidden_size, 1, bias=True)
linear_head.weight.data.copy_(torch.from_numpy(model.coef_))
linear_head.bias.data.copy_(torch.from_numpy(model.intercept_))
torch.save(linear_head.state_dict(), "linear_head.pt")
Benchmarks and Experiments
All metrics below are reported on held-out test sets.
Comparison Table (Test Metrics)
| Experiment | Accuracy | Precision | Recall | F1 | ROC-AUC | Confusion Matrix | Threshold |
|---|---|---|---|---|---|---|---|
| Zero-shot baseline (subject-level) | 0.487654 | 0.490446 | 0.962500 | 0.649789 | 0.562348 | [[2, 80], [3, 77]] | n/a |
| Linear probe (before OpenCV fallback) | 0.558333 | 0.500000 | 0.962264 | 0.658065 | 0.692763 | [[16, 51], [2, 51]] | n/a |
| Linear probe (after OpenCV fallback) | 0.598765 | 0.556391 | 0.925000 | 0.694836 | 0.652896 | [[23, 59], [6, 74]] | n/a |
| Linear probe (full cleaned dataset) | 0.715116 | 0.714286 | 0.670732 | 0.691824 | 0.783740 | [[68, 22], [27, 55]] | n/a |
| Linear probe (TF resize + SAGA) | 0.686047 | 0.694444 | 0.609756 | 0.649351 | 0.812195 | [[68, 22], [32, 50]] | 0.1475936 (ROC best) |
| Linear probe (Auto-PIL + Torch) | 0.726744 | 0.733333 | 0.670732 | 0.700637 | 0.784688 | [[70, 20], [27, 55]] | 0.5850275 (ROC best) |
| Optimized segmented split (recall target 0.90) | 0.710000 | 0.652174 | 0.900000 | 0.756303 | 0.836800 | [[26, 24], [5, 45]] | 0.05 |
| Full-dataset linear probe (latest, recall target 0.90) | 0.823580 | 0.775249 | 0.920608 | 0.841699 | 0.912441 | [[412, 158], [47, 545]] | 0.31 |
Zero-Shot Baseline (Official-Style Prompting, No Training)
- Examples evaluated: 162 (opencv_rescued 42)
- Accuracy: 0.487654
- Precision: 0.490446
- Recall: 0.962500
- F1: 0.649789
- ROC-AUC: 0.562348
- Confusion Matrix: [[2, 80], [3, 77]]
Linear Probe (Before OpenCV Fallback)
- Train/Test usable: 555 / 120 (skipped 145 / 42)
- Accuracy: 0.558333
- Precision: 0.500000
- Recall: 0.962264
- F1: 0.658065
- ROC-AUC: 0.692763
- Confusion Matrix: [[16, 51], [2, 51]]
Linear Probe (After OpenCV Fallback)
- Train/Test: 700 / 162 (opencv_rescued 145 / 42, skipped 0)
- Accuracy: 0.598765
- Precision: 0.556391
- Recall: 0.925000
- F1: 0.694836
- ROC-AUC: 0.652896
- Confusion Matrix: [[23, 59], [6, 74]]
Linear Probe (Full Cleaned Dataset)
- Train/Test: 686 / 172 (opencv_rescued 157 / 30)
- Accuracy: 0.715116
- Precision: 0.714286
- Recall: 0.670732
- F1: 0.691824
- ROC-AUC: 0.783740
- Confusion Matrix: [[68, 22], [27, 55]]
- Parameters: batch_size=16, epochs=400, lr=0.01, weight_decay=0.0001
Linear Probe (TF Resize + SAGA)
- Train/Test: 686 / 172
- Accuracy: 0.686047
- Precision: 0.694444
- Recall: 0.609756
- F1: 0.649351
- ROC-AUC: 0.812195
- Confusion Matrix: [[68, 22], [32, 50]]
- ROC best threshold (Youden J): 0.1475936 (TPR 0.829268, FPR 0.333333)
Linear Probe (Auto-PIL + Torch)
- Train/Test: 686 / 172
- Accuracy: 0.726744
- Precision: 0.733333
- Recall: 0.670732
- F1: 0.700637
- ROC-AUC: 0.784688
- Confusion Matrix: [[70, 20], [27, 55]]
- ROC best threshold (Youden J): 0.5850275 (TPR 0.670732, FPR 0.200000)
Optimized Split + Grid Search + Recall Calibration (Segmented-Only)
- Best C (grid): 3
- Val AUC: 0.836400
- Test AUC: 0.836800
- Recall target: 0.90
- Threshold (val-calibrated): 0.05
- Metrics at threshold:
- Accuracy 0.71
- Precision 0.652174
- Recall 0.90
- F1 0.756303
- Confusion Matrix [[26, 24], [5, 45]]
- Best-F1 threshold (val) applied to test:
- Threshold 0.23
- Accuracy 0.76
- Precision 0.716667
- Recall 0.86
- F1 0.781818
- Confusion Matrix [[33, 17], [7, 43]]
- Recall-target sweep best by F1 at target 0.91:
- Threshold 0.04
- Accuracy 0.72
- Precision 0.657143
- Recall 0.92
- F1 0.766667
- Confusion Matrix [[26, 24], [4, 46]]
- ROC best threshold (Youden J): 0.505267 (TPR 0.78, FPR 0.18)
Full-Dataset Linear Probe (Latest Run)
- Best C (grid): 1
- Val AUC: 0.909260
- Test AUC: 0.912441
- Recall target: 0.90
- Threshold (val-calibrated): 0.31
- Metrics at threshold:
- Accuracy 0.823580
- Precision 0.775249
- Recall 0.920608
- F1 0.841699
- Confusion Matrix [[412, 158], [47, 545]]
- Best-F1 threshold (val) applied to test:
- Threshold 0.44
- Accuracy 0.840792
- Precision 0.823529
- Recall 0.875000
- F1 0.848485
- Confusion Matrix [[459, 111], [74, 518]]
- Recall-target sweep best by F1 at target 0.90:
- Threshold 0.31
- Accuracy 0.812931
- Precision 0.769452
- Recall 0.903553
- F1 0.831128
- Confusion Matrix [[409, 160], [57, 534]]
- ROC best threshold (Youden J): 0.594325 (TPR 0.795262, FPR 0.114236)
Comparison Summary
- Zero-shot baseline shows very high recall but poor specificity and lower overall accuracy and AUC.
- Linear probe improves accuracy and AUC consistently once OpenCV fallback and cleaned datasets are used.
- Full-dataset training provides the strongest overall performance and better calibrated operating points.
Final Decision
The deployed eye model uses the full-dataset linear probe with letterbox resize and recall-target calibration. The deployment threshold is 0.31, which balances high recall with improved precision and overall accuracy.
Limitations
- For research and triage only; not for clinical diagnosis.
- Performance depends on dataset distribution and capture conditions.
- Conjunctiva imaging conditions may vary in real-world settings.
Contact
Model author: Sidharth (Hugging Face: Sidharth1743).