Instructions to use gaspard-loeillot/embeddinggemma-mimic-hierarchical with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use gaspard-loeillot/embeddinggemma-mimic-hierarchical with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("gaspard-loeillot/embeddinggemma-mimic-hierarchical") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
embeddinggemma-mimic-hierarchical
A 300M-parameter sentence embedding model that produces representations of clinical notes
that prioritize what kind of patient state a note describes over how the note is
written. Fine-tuned from google/embeddinggemma-300m
on temporal note pairs from MIMIC-III using a hierarchical contrastive objective with
ICD-9 chapter soft targets (HiMulCon-style).
This model was produced as a CS 4701 Practicum in AI project at Cornell University (Spring 2026). It is a research artifact; it is not approved for any clinical use.
TL;DR
| Metric | OpenAI text-embedding-3-small | OpenAI text-embedding-3-large | EmbeddingGemma-300m (vanilla) | InfoNCE fine-tuned | Hierarchical fine-tuned (this model) |
|---|---|---|---|---|---|
| Top-1 note recall | 0.31% | 0.31% | 0.35% | 0.84% | 1.20% |
| Top-5 note recall | 5.14% | 5.45% | 5.99% | 47.17% | 67.13% |
| Top-10 note recall | 6.44% | 6.99% | 7.57% | 66.68% | 84.44% |
| Diagnosis macro-AUROC (top-25 ICD-9) | 0.895 | 0.905 | 0.897 | 0.945 | 0.947 |
| Silhouette by ICD chapter (cosine, k=5000) | -0.054 | -0.045 | -0.053 | -0.057 | -0.066 |
| Silhouette by note category | +0.016 | +0.043 | -0.017 | -0.089 | -0.098 |
| Silhouette delta (cat - chap) | +0.070 | +0.089 | +0.036 | -0.032 | -0.032 |
The bottom three rows are the most informative: every baseline organizes its embedding space more strongly by note category (style) than by ICD chapter (clinical content). After contrastive fine-tuning, the sign of this delta flips: embeddings now organize themselves more strongly by clinical content than by stylistic structure.
Intended use
- Patient-record search and retrieval over clinical notes (FAISS /
IndexFlatIP). - Patient-similarity / cohort discovery from note-level or patient-aggregated embeddings.
- Clinical-trajectory analysis (per-patient embedding sequences over time).
- A drop-in replacement for general-purpose embedding APIs in research RAG pipelines on EHR-like text.
Out-of-scope use
- Any clinical decision support, diagnostic, or therapeutic application.
- Identifying or re-identifying patients.
- Use on data outside the MIMIC-III DUA without independent ethics approval.
Training details
Base model. google/embeddinggemma-300m (300M parameters; 768-dimensional output via the
SentenceTransformer pooling + dense pipeline).
Corpus. MIMIC-III v1.4 NOTEEVENTS, restricted to the 10,000-patient Kaggle subset, then further sub-sampled to 500 patients (23,657 temporal note pairs) for training compute. The 500-patient subset reflects the team's realized GPU budget and is a known limitation of the released model.
Loss. Hierarchical contrastive loss with soft targets derived from ICD-9 chapter
overlap. For a batch of B (anchor, positive) note pairs and chapter-level structured
labels C_1, ..., C_B:
- diagonal targets are 1.0 (true temporal positive),
- off-diagonal target
T[i,j]ischapter_weight * |C_i ∩ C_j| / |C_i|, then row-normalized to a distribution, - the loss is cross-entropy of
softmax(anchor_i · positive_j^T / temperature)againstT.
This is a HiMulCon-style hierarchy-aware extension of InfoNCE; see Zhang et al. ("Use All The Labels: A Hierarchical Multi-Label Contrastive Learning Framework", arXiv:2204.13207) for the conceptual reference.
Hyperparameters.
chapter_weight = 0.3temperature = 0.07- batch size
32, AdamW withlr = 2e-5, cosine LR schedule, gradient clipping at L2 norm 1.0 - max sequence length
256(CPU-fallback constraint; see Compute notes) - 10 training epochs
Compute notes. Training was performed on Apple MPS / CPU and Google Colab T4. Apple
MPS could not fit the full 512-token training graph for EmbeddingGemma-300m; the team
fell back to CPU at max_length=256. This is reflected in absolute recall numbers and
should be a target of any further work.
Usage
from sentence_transformers import SentenceTransformer
m = SentenceTransformer("gaspard-loeillot/embeddinggemma-mimic-hierarchical")
note_a = "Patient with sepsis 2/2 UTI, started on broad spectrum abx, lactate trending down."
note_b = "Septic shock secondary to urinary source, vasopressors weaned, lactate now 1.2."
embeddings = m.encode([note_a, note_b], normalize_embeddings=True)
similarity = embeddings @ embeddings.T
print(similarity[0, 1]) # cosine similarity
For retrieval at scale, use FAISS:
import faiss, numpy as np
corpus_emb = m.encode(corpus_notes, normalize_embeddings=True, convert_to_numpy=True).astype("float32")
index = faiss.IndexFlatIP(corpus_emb.shape[1])
index.add(corpus_emb)
query_emb = m.encode([query], normalize_embeddings=True, convert_to_numpy=True).astype("float32")
sims, idx = index.search(query_emb, k=10)
Evaluation methodology
All metrics are computed on the same 23,657 pair-anchor / pair-positive embeddings produced from the 500-patient MIMIC-III subset. The five evaluated models share the identical inputs and identical evaluation code (no per-model split tuning).
- Note recall: for each anchor at row i, rank all positive embeddings by cosine similarity to the anchor; report the fraction of anchors whose true positive (also at row i) appears in the top-k most similar positives.
- Diagnosis macro-AUROC: train OneVsRest logistic regression (sklearn defaults,
C=1.0,max_iter=1000) on frozen anchor embeddings to predict the top-25 most frequent ICD-9 diagnosis codes; report the macro-averaged AUROC on a held-out 20% test split (random,random_state=42). - Silhouette by chapter / category: silhouette score on cosine distance over a
fixed 5000-row sample of anchor embeddings, using either (a) the primary ICD-9 chapter
of the anchor's admission, or (b) the
NOTEEVENTS.CATEGORYfield of the anchor note, as the partition labels.
Limitations
- Trained on 500 patients, not 10,000+. This is the realized compute budget, not the final design intent. Absolute recall numbers should be expected to improve with more training data.
- 256-token context cap. Long notes are truncated, which loses information for discharge summaries in particular.
- Negative absolute silhouettes. Even after fine-tuning, the embedding space does not produce clean ICD-chapter-aligned clusters in 2D UMAP; chapter structure is encoded but at finer scale than UMAP can resolve. The sign-flipped silhouette delta is the headline finding (the model is more content-aligned than style-aligned), but absolute geometric separability remains modest.
- Demographic and institutional skew. MIMIC-III is a single ICU at a single tertiary care center over 2001-2012. Generalization outside this distribution is not validated.
- Not certified for any clinical use. This is a research artifact.
Companion model
A temporal-only variant (no hierarchy) is available at
gaspard-loeillot/embeddinggemma-mimic-infonce.
Comparing the two isolates the effect of the hierarchical soft targets versus pure
InfoNCE.
Citation
@misc{shvartsman_lin_loeillot_2026_embeddinggemma_mimic_hierarchical,
author = {Shvartsman, Benjamin and Lin, Timothy and Loeillot, Gaspard},
title = {EmbeddingGemma fine-tuned on MIMIC-III with hierarchical
contrastive learning},
year = {2026},
howpublished = {Cornell CS 4701 Practicum in AI project},
note = {\url{https://huggingface.co/gaspard-loeillot/embeddinggemma-mimic-hierarchical}}
}
If you use the base model, please also cite EmbeddingGemma:
@article{embedding_gemma_2025,
title={EmbeddingGemma: Powerful and Lightweight Text Representations},
author={Schechter Vera, Henrique and others},
publisher={Google DeepMind},
year={2025},
url={https://arxiv.org/abs/2509.20354}
}
And, if you fine-tune on similar data, the underlying clinical resource:
Johnson AEW, Pollard TJ, Shen L, Lehman LH, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, Mark RG. MIMIC-III, a freely accessible critical care database. Scientific Data, 2016. doi:10.1038/sdata.2016.35
Acknowledgements
This project replicates and extends the contrastive fine-tuning approach described by Radical Health AI in "Training a model that understands your notes 7x better than OpenAI" (2025). All errors are our own.
- Downloads last month
- 17
Model tree for gaspard-loeillot/embeddinggemma-mimic-hierarchical
Base model
google/embeddinggemma-300m