embeddinggemma-mimic-infonce

A 300M-parameter sentence embedding model fine-tuned from google/embeddinggemma-300m on temporal note pairs from MIMIC-III using a pure InfoNCE temporal contrastive objective. This is the temporal-only baseline within the project; a hierarchical-loss extension is available at gaspard-loeillot/embeddinggemma-mimic-hierarchical.

This model was produced as a CS 4701 Practicum in AI project at Cornell University (Spring 2026). It is a research artifact; it is not approved for any clinical use.

TL;DR

Metric OpenAI text-embedding-3-small OpenAI text-embedding-3-large EmbeddingGemma-300m (vanilla) InfoNCE fine-tuned (this model) Hierarchical fine-tuned
Top-1 note recall 0.31% 0.31% 0.35% 0.84% 1.20%
Top-5 note recall 5.14% 5.45% 5.99% 47.17% 67.13%
Top-10 note recall 6.44% 6.99% 7.57% 66.68% 84.44%
Diagnosis macro-AUROC (top-25 ICD-9) 0.895 0.905 0.897 0.945 0.947
Silhouette by ICD chapter (cosine, k=5000) -0.054 -0.045 -0.053 -0.057 -0.066
Silhouette by note category +0.016 +0.043 -0.017 -0.089 -0.098
Silhouette delta (cat - chap) +0.070 +0.089 +0.036 -0.032 -0.032

The bottom three rows are the most informative: every baseline organizes its embedding space more strongly by note category (style) than by ICD chapter (clinical content). After contrastive fine-tuning, the sign of this delta flips: embeddings now organize themselves more strongly by clinical content than by stylistic structure. On note recall this model achieves 47.17% top-5 — close to the 65% Radical Health baseline and ~9x the vanilla EmbeddingGemma starting point — using only the temporal positive signal (no hierarchical labels). The hierarchical extension closes the remaining gap.

Intended use

  • Patient-record search and retrieval over clinical notes.
  • Patient-similarity / cohort discovery.
  • Clinical-trajectory analysis.
  • A drop-in replacement for general-purpose embedding APIs in research RAG pipelines on EHR-like text.

Out-of-scope use

  • Any clinical decision support, diagnostic, or therapeutic application.
  • Identifying or re-identifying patients.
  • Use on data outside the MIMIC-III DUA without independent ethics approval.

Training details

Base model. google/embeddinggemma-300m (300M parameters; 768-dimensional output via the SentenceTransformer pooling + dense pipeline).

Corpus. MIMIC-III v1.4 NOTEEVENTS, restricted to the 10,000-patient Kaggle subset, then further sub-sampled to 500 patients (23,657 temporal note pairs) for training compute. The 500-patient subset reflects the team's realized GPU budget and is a known limitation of the released model.

Loss. Standard InfoNCE temporal contrastive loss with in-batch negatives. For a batch of B (anchor, positive) note pairs:

logits[i, j] = (anchor_i · positive_j^T) / temperature
loss         = cross_entropy(logits, labels=arange(B))

The anchor is a patient note at time t, the positive is the same patient's note at time t+1, and negatives are the other B-1 positives in the batch (notes from different patients). This is the same temporal contrastive setup used by Radical Health AI in their MIMIC-III work.

Hyperparameters.

  • temperature = 0.07
  • batch size 32, AdamW with lr = 2e-5, cosine LR schedule, gradient clipping at L2 norm 1.0
  • max sequence length 256 (CPU-fallback constraint; see Compute notes)
  • 5 training epochs

Compute notes. Training was performed on Apple MPS / CPU and Google Colab T4. Apple MPS could not fit the full 512-token training graph for EmbeddingGemma-300m; the team fell back to CPU at max_length=256. This is reflected in absolute recall numbers and should be a target of any further work.

Usage

from sentence_transformers import SentenceTransformer

m = SentenceTransformer("gaspard-loeillot/embeddinggemma-mimic-infonce")
embeddings = m.encode(["clinical note text..."], normalize_embeddings=True)

For retrieval at scale, use FAISS:

import faiss, numpy as np

corpus_emb = m.encode(corpus_notes, normalize_embeddings=True, convert_to_numpy=True).astype("float32")
index = faiss.IndexFlatIP(corpus_emb.shape[1])
index.add(corpus_emb)
query_emb = m.encode([query], normalize_embeddings=True, convert_to_numpy=True).astype("float32")
sims, idx = index.search(query_emb, k=10)

Evaluation methodology

See the hierarchical companion model card for shared methodology details. All five evaluated models share identical inputs, splits, and evaluation code.

Limitations

  1. Trained on 500 patients, not 10,000+. This is the realized compute budget, not the final design intent.
  2. 256-token context cap. Long notes are truncated.
  3. Within-patient retrieval gap to hierarchical variant. This model achieves 47% top-5 recall vs. 67% for the hierarchical extension. If retrieval is the primary downstream use, the hierarchical model is preferred.
  4. Demographic and institutional skew. MIMIC-III is a single ICU at a single tertiary care center over 2001-2012. Generalization outside this distribution is not validated.
  5. Not certified for any clinical use.

Citation

@misc{shvartsman_lin_loeillot_2026_embeddinggemma_mimic_infonce,
  author       = {Shvartsman, Benjamin and Lin, Timothy and Loeillot, Gaspard},
  title        = {EmbeddingGemma fine-tuned on MIMIC-III with InfoNCE temporal
                  contrastive learning},
  year         = {2026},
  howpublished = {Cornell CS 4701 Practicum in AI project},
  note         = {\url{https://huggingface.co/gaspard-loeillot/embeddinggemma-mimic-infonce}}
}

If you use the base model, please also cite EmbeddingGemma:

@article{embedding_gemma_2025,
  title={EmbeddingGemma: Powerful and Lightweight Text Representations},
  author={Schechter Vera, Henrique and others},
  publisher={Google DeepMind},
  year={2025},
  url={https://arxiv.org/abs/2509.20354}
}

And, if you fine-tune on similar data, the underlying clinical resource:

Johnson AEW, Pollard TJ, Shen L, Lehman LH, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, Mark RG. MIMIC-III, a freely accessible critical care database. Scientific Data, 2016. doi:10.1038/sdata.2016.35

Acknowledgements

This project replicates and extends the contrastive fine-tuning approach described by Radical Health AI in "Training a model that understands your notes 7x better than OpenAI" (2025). All errors are our own.

Downloads last month
18
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gaspard-loeillot/embeddinggemma-mimic-infonce

Finetuned
(236)
this model

Paper for gaspard-loeillot/embeddinggemma-mimic-infonce