embeddinggemma-mimic-hierarchical

A 300M-parameter sentence embedding model that produces representations of clinical notes that prioritize what kind of patient state a note describes over how the note is written. Fine-tuned from google/embeddinggemma-300m on temporal note pairs from MIMIC-III using a hierarchical contrastive objective with ICD-9 chapter soft targets (HiMulCon-style).

This model was produced as a CS 4701 Practicum in AI project at Cornell University (Spring 2026). It is a research artifact; it is not approved for any clinical use.

TL;DR

Metric	OpenAI text-embedding-3-small	OpenAI text-embedding-3-large	EmbeddingGemma-300m (vanilla)	InfoNCE fine-tuned	Hierarchical fine-tuned (this model)
Top-1 note recall	0.31%	0.31%	0.35%	0.84%	1.20%
Top-5 note recall	5.14%	5.45%	5.99%	47.17%	67.13%
Top-10 note recall	6.44%	6.99%	7.57%	66.68%	84.44%
Diagnosis macro-AUROC (top-25 ICD-9)	0.895	0.905	0.897	0.945	0.947
Silhouette by ICD chapter (cosine, k=5000)	-0.054	-0.045	-0.053	-0.057	-0.066
Silhouette by note category	+0.016	+0.043	-0.017	-0.089	-0.098
Silhouette delta (cat - chap)	+0.070	+0.089	+0.036	-0.032	-0.032

The bottom three rows are the most informative: every baseline organizes its embedding space more strongly by note category (style) than by ICD chapter (clinical content). After contrastive fine-tuning, the sign of this delta flips: embeddings now organize themselves more strongly by clinical content than by stylistic structure.

Intended use

Patient-record search and retrieval over clinical notes (FAISS / IndexFlatIP).
Patient-similarity / cohort discovery from note-level or patient-aggregated embeddings.
Clinical-trajectory analysis (per-patient embedding sequences over time).
A drop-in replacement for general-purpose embedding APIs in research RAG pipelines on EHR-like text.

Out-of-scope use

Any clinical decision support, diagnostic, or therapeutic application.
Identifying or re-identifying patients.
Use on data outside the MIMIC-III DUA without independent ethics approval.

Training details

Base model. google/embeddinggemma-300m (300M parameters; 768-dimensional output via the SentenceTransformer pooling + dense pipeline).

Corpus. MIMIC-III v1.4 NOTEEVENTS, restricted to the 10,000-patient Kaggle subset, then further sub-sampled to 500 patients (23,657 temporal note pairs) for training compute. The 500-patient subset reflects the team's realized GPU budget and is a known limitation of the released model.

Loss. Hierarchical contrastive loss with soft targets derived from ICD-9 chapter overlap. For a batch of B (anchor, positive) note pairs and chapter-level structured labels C_1, ..., C_B:

diagonal targets are 1.0 (true temporal positive),
off-diagonal target T[i,j] is chapter_weight * |C_i ∩ C_j| / |C_i|, then row-normalized to a distribution,
the loss is cross-entropy of softmax(anchor_i · positive_j^T / temperature) against T.

This is a HiMulCon-style hierarchy-aware extension of InfoNCE; see Zhang et al. ("Use All The Labels: A Hierarchical Multi-Label Contrastive Learning Framework", arXiv:2204.13207) for the conceptual reference.

Hyperparameters.

chapter_weight = 0.3
temperature = 0.07
batch size 32, AdamW with lr = 2e-5, cosine LR schedule, gradient clipping at L2 norm 1.0
max sequence length 256 (CPU-fallback constraint; see Compute notes)
10 training epochs

Compute notes. Training was performed on Apple MPS / CPU and Google Colab T4. Apple MPS could not fit the full 512-token training graph for EmbeddingGemma-300m; the team fell back to CPU at max_length=256. This is reflected in absolute recall numbers and should be a target of any further work.

Usage

from sentence_transformers import SentenceTransformer

m = SentenceTransformer("gaspard-loeillot/embeddinggemma-mimic-hierarchical")
note_a = "Patient with sepsis 2/2 UTI, started on broad spectrum abx, lactate trending down."
note_b = "Septic shock secondary to urinary source, vasopressors weaned, lactate now 1.2."
embeddings = m.encode([note_a, note_b], normalize_embeddings=True)
similarity = embeddings @ embeddings.T
print(similarity[0, 1])  # cosine similarity

For retrieval at scale, use FAISS:

import faiss, numpy as np

corpus_emb = m.encode(corpus_notes, normalize_embeddings=True, convert_to_numpy=True).astype("float32")
index = faiss.IndexFlatIP(corpus_emb.shape[1])
index.add(corpus_emb)
query_emb = m.encode([query], normalize_embeddings=True, convert_to_numpy=True).astype("float32")
sims, idx = index.search(query_emb, k=10)

Evaluation methodology

All metrics are computed on the same 23,657 pair-anchor / pair-positive embeddings produced from the 500-patient MIMIC-III subset. The five evaluated models share the identical inputs and identical evaluation code (no per-model split tuning).

Note recall: for each anchor at row i, rank all positive embeddings by cosine similarity to the anchor; report the fraction of anchors whose true positive (also at row i) appears in the top-k most similar positives.
Diagnosis macro-AUROC: train OneVsRest logistic regression (sklearn defaults, C=1.0, max_iter=1000) on frozen anchor embeddings to predict the top-25 most frequent ICD-9 diagnosis codes; report the macro-averaged AUROC on a held-out 20% test split (random, random_state=42).
Silhouette by chapter / category: silhouette score on cosine distance over a fixed 5000-row sample of anchor embeddings, using either (a) the primary ICD-9 chapter of the anchor's admission, or (b) the NOTEEVENTS.CATEGORY field of the anchor note, as the partition labels.

Limitations

Trained on 500 patients, not 10,000+. This is the realized compute budget, not the final design intent. Absolute recall numbers should be expected to improve with more training data.
256-token context cap. Long notes are truncated, which loses information for discharge summaries in particular.
Negative absolute silhouettes. Even after fine-tuning, the embedding space does not produce clean ICD-chapter-aligned clusters in 2D UMAP; chapter structure is encoded but at finer scale than UMAP can resolve. The sign-flipped silhouette delta is the headline finding (the model is more content-aligned than style-aligned), but absolute geometric separability remains modest.
Demographic and institutional skew. MIMIC-III is a single ICU at a single tertiary care center over 2001-2012. Generalization outside this distribution is not validated.
Not certified for any clinical use. This is a research artifact.

Companion model

A temporal-only variant (no hierarchy) is available at gaspard-loeillot/embeddinggemma-mimic-infonce. Comparing the two isolates the effect of the hierarchical soft targets versus pure InfoNCE.

Citation

@misc{shvartsman_lin_loeillot_2026_embeddinggemma_mimic_hierarchical,
  author       = {Shvartsman, Benjamin and Lin, Timothy and Loeillot, Gaspard},
  title        = {EmbeddingGemma fine-tuned on MIMIC-III with hierarchical
                  contrastive learning},
  year         = {2026},
  howpublished = {Cornell CS 4701 Practicum in AI project},
  note         = {\url{https://huggingface.co/gaspard-loeillot/embeddinggemma-mimic-hierarchical}}
}

If you use the base model, please also cite EmbeddingGemma:

@article{embedding_gemma_2025,
  title={EmbeddingGemma: Powerful and Lightweight Text Representations},
  author={Schechter Vera, Henrique and others},
  publisher={Google DeepMind},
  year={2025},
  url={https://arxiv.org/abs/2509.20354}
}

And, if you fine-tune on similar data, the underlying clinical resource:

Johnson AEW, Pollard TJ, Shen L, Lehman LH, Feng M, Ghassemi M, Moody B, Szolovits P, Celi LA, Mark RG. MIMIC-III, a freely accessible critical care database. Scientific Data, 2016. doi:10.1038/sdata.2016.35

Acknowledgements

This project replicates and extends the contrastive fine-tuning approach described by Radical Health AI in "Training a model that understands your notes 7x better than OpenAI" (2025). All errors are our own.