Eye MedSigLIP Linear Probe

Summary

This repository provides an eye (conjunctiva) anemia classifier built on a frozen MedSigLIP vision encoder with a lightweight linear probe (logistic regression converted to a torch linear head). The model outputs a triage score and a binary prediction (Anemia vs Non-Anemia).

This work follows Google’s recommended data‑efficient workflow: freeze the vision encoder, extract embeddings, and train a lightweight linear classifier on top. The linear probe is trained with scikit‑learn LogisticRegression using the saga solver (data‑efficient and scalable), then converted into a torch linear head for deployment. The training code also handles image decoding failures (Pillow -> OpenCV fallback), optional preprocessing toggles, and supports threshold calibration for high‑recall deployment.

Model Artifacts

The repository includes the full deployable bundle:

  • artifacts/vision_model/ (MedSigLIP vision encoder weights + config)
  • artifacts/linear_head.pt (linear probe head)
  • artifacts/scaler.joblib (mean/std for embedding standardization)
  • artifacts/config.json (threshold + preprocessing flags)
  • artifacts/optimized/ (color constancy utilities)

Default deployment threshold (from artifacts/config.json): 0.31.

Dataset (Counts Only)

Primary dataset folder used in this project:

  • Dataset/dataset anemia/ with class folders Anemia and Non-Anemia.

Full Cleaned Dataset (Image-Level)

  • Train: 686 images (Anemia: 280, Non-Anemia: 406)
  • Test: 172 images (Anemia: 82, Non-Anemia: 90)
  • Image decoding: train opencv_rescued 157, test opencv_rescued 30

Full-Dataset Linear Probe (Latest Run)

  • Train/Val/Test counts: 9283 / 1160 / 1162
  • Class balance:
    • Train: Anemia 4730, Non-Anemia 4553
    • Val: Anemia 591, Non-Anemia 569
    • Test: Anemia 592, Non-Anemia 570

Segmented-Only Optimized Split (CP-AnemiC + segmented)

  • Train/Val/Test: 800 / 100 / 100
  • Class balance:
    • Train: Anemia 400, Non-Anemia 400
    • Val: Anemia 50, Non-Anemia 50
    • Test: Anemia 50, Non-Anemia 50

Preprocessing

  • Resize to 448x448 using letterbox padding (no stretching) unless stated otherwise.
  • RGB conversion and normalization to [-1, 1].
  • OpenCV fallback for decoding certain PNGs.

Training Code Notes

  • Embedding extraction: frozen MedSigLIP vision encoder (no fine‑tuning).
  • Classifier: scikit‑learn LogisticRegression with solver=\"saga\".
  • Standardization: embedding mean/std saved to scaler.joblib.
  • Deployment conversion: learned coefficients are converted to a torch Linear head and saved as linear_head.pt.

Core Training Code (Key Steps)

# 1) Extract frozen MedSigLIP embeddings
vision_model.eval()
with torch.no_grad():
    outputs = vision_model(pixel_values=tensor)
    embeds = outputs.pooler_output
    embeds = embeds / embeds.norm(p=2, dim=-1, keepdim=True)

# 2) Standardize embeddings (mean/std from train set)
x_std = (x - scaler["mean"]) / scaler["std"]

# 3) Train linear probe (logistic regression, saga)
model = LogisticRegression(
    solver="saga",
    max_iter=5000,
    C=best_c,
    n_jobs=-1,
)
model.fit(x_train_std, y_train)

# 4) Convert to torch Linear head for deployment
linear_head = torch.nn.Linear(vision_model.config.hidden_size, 1, bias=True)
linear_head.weight.data.copy_(torch.from_numpy(model.coef_))
linear_head.bias.data.copy_(torch.from_numpy(model.intercept_))
torch.save(linear_head.state_dict(), "linear_head.pt")

Benchmarks and Experiments

All metrics below are reported on held-out test sets.

Comparison Table (Test Metrics)

Experiment Accuracy Precision Recall F1 ROC-AUC Confusion Matrix Threshold
Zero-shot baseline (subject-level) 0.487654 0.490446 0.962500 0.649789 0.562348 [[2, 80], [3, 77]] n/a
Linear probe (before OpenCV fallback) 0.558333 0.500000 0.962264 0.658065 0.692763 [[16, 51], [2, 51]] n/a
Linear probe (after OpenCV fallback) 0.598765 0.556391 0.925000 0.694836 0.652896 [[23, 59], [6, 74]] n/a
Linear probe (full cleaned dataset) 0.715116 0.714286 0.670732 0.691824 0.783740 [[68, 22], [27, 55]] n/a
Linear probe (TF resize + SAGA) 0.686047 0.694444 0.609756 0.649351 0.812195 [[68, 22], [32, 50]] 0.1475936 (ROC best)
Linear probe (Auto-PIL + Torch) 0.726744 0.733333 0.670732 0.700637 0.784688 [[70, 20], [27, 55]] 0.5850275 (ROC best)
Optimized segmented split (recall target 0.90) 0.710000 0.652174 0.900000 0.756303 0.836800 [[26, 24], [5, 45]] 0.05
Full-dataset linear probe (latest, recall target 0.90) 0.823580 0.775249 0.920608 0.841699 0.912441 [[412, 158], [47, 545]] 0.31

Zero-Shot Baseline (Official-Style Prompting, No Training)

  • Examples evaluated: 162 (opencv_rescued 42)
  • Accuracy: 0.487654
  • Precision: 0.490446
  • Recall: 0.962500
  • F1: 0.649789
  • ROC-AUC: 0.562348
  • Confusion Matrix: [[2, 80], [3, 77]]

Linear Probe (Before OpenCV Fallback)

  • Train/Test usable: 555 / 120 (skipped 145 / 42)
  • Accuracy: 0.558333
  • Precision: 0.500000
  • Recall: 0.962264
  • F1: 0.658065
  • ROC-AUC: 0.692763
  • Confusion Matrix: [[16, 51], [2, 51]]

Linear Probe (After OpenCV Fallback)

  • Train/Test: 700 / 162 (opencv_rescued 145 / 42, skipped 0)
  • Accuracy: 0.598765
  • Precision: 0.556391
  • Recall: 0.925000
  • F1: 0.694836
  • ROC-AUC: 0.652896
  • Confusion Matrix: [[23, 59], [6, 74]]

Linear Probe (Full Cleaned Dataset)

  • Train/Test: 686 / 172 (opencv_rescued 157 / 30)
  • Accuracy: 0.715116
  • Precision: 0.714286
  • Recall: 0.670732
  • F1: 0.691824
  • ROC-AUC: 0.783740
  • Confusion Matrix: [[68, 22], [27, 55]]
  • Parameters: batch_size=16, epochs=400, lr=0.01, weight_decay=0.0001

Linear Probe (TF Resize + SAGA)

  • Train/Test: 686 / 172
  • Accuracy: 0.686047
  • Precision: 0.694444
  • Recall: 0.609756
  • F1: 0.649351
  • ROC-AUC: 0.812195
  • Confusion Matrix: [[68, 22], [32, 50]]
  • ROC best threshold (Youden J): 0.1475936 (TPR 0.829268, FPR 0.333333)

Linear Probe (Auto-PIL + Torch)

  • Train/Test: 686 / 172
  • Accuracy: 0.726744
  • Precision: 0.733333
  • Recall: 0.670732
  • F1: 0.700637
  • ROC-AUC: 0.784688
  • Confusion Matrix: [[70, 20], [27, 55]]
  • ROC best threshold (Youden J): 0.5850275 (TPR 0.670732, FPR 0.200000)

Optimized Split + Grid Search + Recall Calibration (Segmented-Only)

  • Best C (grid): 3
  • Val AUC: 0.836400
  • Test AUC: 0.836800
  • Recall target: 0.90
  • Threshold (val-calibrated): 0.05
  • Metrics at threshold:
    • Accuracy 0.71
    • Precision 0.652174
    • Recall 0.90
    • F1 0.756303
    • Confusion Matrix [[26, 24], [5, 45]]
  • Best-F1 threshold (val) applied to test:
    • Threshold 0.23
    • Accuracy 0.76
    • Precision 0.716667
    • Recall 0.86
    • F1 0.781818
    • Confusion Matrix [[33, 17], [7, 43]]
  • Recall-target sweep best by F1 at target 0.91:
    • Threshold 0.04
    • Accuracy 0.72
    • Precision 0.657143
    • Recall 0.92
    • F1 0.766667
    • Confusion Matrix [[26, 24], [4, 46]]
  • ROC best threshold (Youden J): 0.505267 (TPR 0.78, FPR 0.18)

Full-Dataset Linear Probe (Latest Run)

  • Best C (grid): 1
  • Val AUC: 0.909260
  • Test AUC: 0.912441
  • Recall target: 0.90
  • Threshold (val-calibrated): 0.31
  • Metrics at threshold:
    • Accuracy 0.823580
    • Precision 0.775249
    • Recall 0.920608
    • F1 0.841699
    • Confusion Matrix [[412, 158], [47, 545]]
  • Best-F1 threshold (val) applied to test:
    • Threshold 0.44
    • Accuracy 0.840792
    • Precision 0.823529
    • Recall 0.875000
    • F1 0.848485
    • Confusion Matrix [[459, 111], [74, 518]]
  • Recall-target sweep best by F1 at target 0.90:
    • Threshold 0.31
    • Accuracy 0.812931
    • Precision 0.769452
    • Recall 0.903553
    • F1 0.831128
    • Confusion Matrix [[409, 160], [57, 534]]
  • ROC best threshold (Youden J): 0.594325 (TPR 0.795262, FPR 0.114236)

Comparison Summary

  • Zero-shot baseline shows very high recall but poor specificity and lower overall accuracy and AUC.
  • Linear probe improves accuracy and AUC consistently once OpenCV fallback and cleaned datasets are used.
  • Full-dataset training provides the strongest overall performance and better calibrated operating points.

Final Decision

The deployed eye model uses the full-dataset linear probe with letterbox resize and recall-target calibration. The deployment threshold is 0.31, which balances high recall with improved precision and overall accuracy.

Limitations

  • For research and triage only; not for clinical diagnosis.
  • Performance depends on dataset distribution and capture conditions.
  • Conjunctiva imaging conditions may vary in real-world settings.

Contact

Model author: Sidharth (Hugging Face: Sidharth1743).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support