Initial release: CGM-JEPA + X-CGM-JEPA + retrained baselines

Browse files

Files changed (5) hide show

README.md +224 -3
baselines/gluformer.pt +3 -0
baselines/ts2vec.pkl +3 -0
cgm_jepa/cgm_jepa.pt +3 -0
x_cgm_jepa/x_cgm_jepa.pt +3 -0

README.md CHANGED Viewed

@@ -1,3 +1,224 @@
----
-license: mit
----

+---
+license: mit
+language:
+- en
+library_name: pytorch
+pipeline_tag: feature-extraction
+tags:
+- cgm
+- continuous-glucose-monitor
+- self-supervised-learning
+- jepa
+- time-series
+- masked-prediction
+- biosignal
+- healthcare
+- pretrained-encoder
+---
+# CGM-JEPA Pretrained Encoders
+Frozen self-supervised encoder weights from the paper *CGM-JEPA: Learning Consistent Continuous Glucose Monitor Representations via Predictive Self-Supervised Pretraining*. The repo contains the **exact checkpoints used to produce Tables 1–8 of the paper** for both the paper's main contributions (CGM-JEPA, X-CGM-JEPA) and the two re-pretrained baselines (GluFormer, TS2Vec).
+> Companion repos: pretraining dataset [`CRUISEResearchGroup/CGM-JEPA-Pretraining`](https://huggingface.co/datasets/CRUISEResearchGroup/CGM-JEPA-Pretraining), labeled splits [`CRUISEResearchGroup/CGM-JEPA-Downstream`](https://huggingface.co/datasets/CRUISEResearchGroup/CGM-JEPA-Downstream), code [github.com/cruiseresearchgroup/CGM-JEPA](https://github.com/cruiseresearchgroup/CGM-JEPA).
+> **MOMENT and Mantis are not redistributed here.** Those baselines are loaded directly from their upstream HF repos (`AutonLab/MOMENT-1-{small,large}`, `paris-noah/Mantis-8M`) by the eval pipeline.
+## Quick start
+```bash
+huggingface-cli download CRUISEResearchGroup/CGM-JEPA --local-dir Output
+```
+Then from the [code repository](https://github.com/cruiseresearchgroup/CGM-JEPA):
+```bash
+# Reproduce paper Tables 1–6
+python scripts/run_all_eval.py
+```
+The downstream eval will load all four checkpoints automatically from the subdirectories below.
+## Layout
+```
+.
+├── cgm_jepa/
+│   └── cgm_jepa.pt           # 3.8 MB — paper main contribution
+├── x_cgm_jepa/
+│   └── x_cgm_jepa.pt         # 3.8 MB — paper main contribution
+└── baselines/
+    ├── gluformer.pt          # 1.1 MB — baseline, re-pretrained on the open CGM corpus
+    └── ts2vec.pkl            # 1.2 MB — baseline, re-pretrained on the open CGM corpus
+```
+Each `.pt` file contains `{"encoder": state_dict}`. `ts2vec.pkl` is a full pickled `TS2Vec` model object (per the upstream library's convention), not a state dict. Architecture hyperparameters needed to instantiate each model are listed in the [Architectures](#architectures) section below.
+### Important note on the baselines
+`gluformer.pt` and `ts2vec.pkl` are **not** vendored from upstream releases of those methods. They were **re-pretrained on the same open CGM corpus and compute budget as CGM-JEPA / X-CGM-JEPA** (Stanford + Colas, 101 epochs, batch 128, lr 1e-4, seed 43) so that the comparison in the paper isolates the pretraining objective rather than mixing in corpus or compute differences. Use these checkpoints when reproducing paper numbers; for other settings, prefer the original authors' releases.
+## Architectures
+### `cgm_jepa/cgm_jepa.pt` and `x_cgm_jepa/x_cgm_jepa.pt`
+Both use the same `models.encoder.Encoder` class with identical hyperparameters; only the pretraining objective differs. At downstream / inference time only the temporal encoder is used, so the two checkpoints are drop-in interchangeable.
+| Field | Value |
+|---|---|
+| `patch_size` | 12 |
+| `encoder_kernel_size` | 3 |
+| `encoder_embed_dim` | 96 |
+| `encoder_embed_bias` | `True` |
+| `encoder_nhead` | 6 |
+| `encoder_num_layers` | 3 |
+| `encoder_dropout` | 0.0 |
+Input: a tensor of shape `(B, num_patches, patch_size)` (raw glucose values, z-scored).
+Output: per-patch embedding of shape `(B, num_patches, embed_dim)`. Pool with `.mean(dim=1)` for a single embedding per sample.
+X-CGM-JEPA adds a second pretraining branch that predicts Glucodensity image patches; only the temporal encoder is loaded at inference.
+### `baselines/gluformer.pt`
+`models.gluformer.GluFormer`:
+| Field | Value |
+|---|---|
+| `vocab_size` | **278** (older checkpoint — pre-dates the current 280-bin default in the code) |
+| `embed_dim` | 96 |
+| `nhead` | 6 |
+| `num_layers` | 3 |
+| `dim_feedforward` | 192 |
+| `max_seq_length` | 25000 |
+| `dropout` | 0.0 |
+| `pad_token` | 278 (= `vocab_size`) |
+Input: a tensor of integer bin indices in `[0, vocab_size)` (raw glucose discretized into the 40–320 mg/dL range with width `(320 − 40) / vocab_size`). The downstream pipeline detaches GluFormer's output head and uses only the encoder embedding.
+### `baselines/ts2vec.pkl`
+`models.ts2vec.TS2Vec` (loaded via `eval/baseline_utils/ts2vec_utils.py:load_pretrained_ts2vec`):
+| Field | Value |
+|---|---|
+| `input_dims` | 1 |
+| `output_dims` | 96 |
+| `hidden_dims` | 64 |
+| `depth` | 10 |
+Saved as a Python pickle of the full model object, matching the upstream `ts2vec` library convention.
+## Loading examples
+### From the CGM-JEPA code repository (recommended)
+`config/model_configs.py` already looks for these checkpoints under `Output/cgm_jepa/`, `Output/x_cgm_jepa/`, and `Output/baselines/`. Place the downloaded files there (the `huggingface-cli download` command above does this automatically) and the eval pipeline picks them up.
+### Standalone PyTorch — CGM-JEPA / X-CGM-JEPA
+```python
+import torch
+from models.encoder import Encoder
+encoder = Encoder(
+    dim_in=12,            # patch_size
+    kernel_size=3,        # encoder_kernel_size
+    embed_dim=96,         # encoder_embed_dim
+    embed_bias=True,      # encoder_embed_bias
+    nhead=6,              # encoder_nhead
+    num_layers=3,         # encoder_num_layers
+    jepa=False,           # disable JEPA-specific heads for inference
+)
+encoder.load_state_dict(
+    torch.load("Output/cgm_jepa/cgm_jepa.pt", map_location="cpu")["encoder"],
+    strict=False,
+)
+encoder.eval()
+# For X-CGM-JEPA, swap the path to "Output/x_cgm_jepa/x_cgm_jepa.pt".
+```
+### Standalone PyTorch — GluFormer
+```python
+import torch
+import torch.nn as nn
+from models.gluformer.gluformer import GluFormer
+vocab_size = 278
+gluformer = GluFormer(
+    vocab_size=vocab_size,
+    embed_dim=96,
+    nhead=6,
+    num_layers=3,
+    dim_feedforward=192,
+    max_seq_length=25000,
+    dropout=0.0,
+    pad_token=vocab_size,
+)
+gluformer.load_state_dict(
+    torch.load("Output/baselines/gluformer.pt", map_location="cpu")["encoder"]
+)
+gluformer.output_head = nn.Identity()   # discard the LM head for embedding extraction
+gluformer.eval()
+```
+### Standalone PyTorch — TS2Vec
+```python
+from eval.baseline_utils.ts2vec_utils import load_pretrained_ts2vec
+ts2vec = load_pretrained_ts2vec(
+    checkpoint_path="Output/baselines/ts2vec.pkl",
+    device="cpu",
+    input_dims=1,
+    output_dims=96,
+    hidden_dims=64,
+    depth=10,
+)
+```
+## Pretraining
+All four encoders were pretrained on the [CGM-JEPA pretraining corpus](https://huggingface.co/datasets/CRUISEResearchGroup/CGM-JEPA-Pretraining) under identical conditions:
+| Setting | Value |
+|---|---|
+| Corpus | 228 subjects (22 Stanford + 206 Colas), 389,365 readings at 5-min sampling |
+| Window length | 288 timesteps (24 hours) |
+| Masking ratio | 0.25 |
+| Epochs | 101 |
+| Batch size | 128 |
+| Learning rate | 1e-4 |
+| Random seed | 43 |
+See [`config/config_pretrain.py`](https://github.com/cruiseresearchgroup/CGM-JEPA/blob/main/config/config_pretrain.py) for the full configuration.
+## Intended use
+- **Frozen feature extraction** from raw CGM windows (24-hour, 5-min sampled, 288 timesteps).
+- **Linear-probe or shallow-classifier downstream evaluation**, especially the IR / β-cell dysfunction tasks in the paper.
+- **Comparison baseline** for new CGM representation methods, with identical pretraining conditions across all four encoders shipped here.
+## Limitations
+- **Window size**: encoders were trained on 288-timestep (24-hour) windows. Behavior at substantially longer or shorter sequence lengths has not been validated.
+- **Sensor distribution shift**: the underlying pretraining corpus is from Dexcom-class sensors. Behavior on other devices is unverified.
+- **Not clinical**: outputs are continuous embeddings, not clinical decisions. Use only as a feature extractor in research contexts.
+- **Baseline scope**: the GluFormer and TS2Vec checkpoints reproduce each method's published architecture under our pretraining corpus; they are not vendored from the original authors' releases. For non-paper use cases, prefer the upstream checkpoints.
+## License & attribution
+Released under the **MIT license**. When using these weights, please cite:
+1. Our paper (citation TBD; see code repo).
+2. The two upstream pretraining datasets — Metwally et al. 2025 (*Nature Biomedical Engineering*) and Colas et al. 2019 (*PLOS ONE*).
+3. The original baseline papers when using `gluformer.pt` or `ts2vec.pkl`.
+## Citation
+> _Citation block to be filled once the CGM-JEPA paper has a stable venue / arXiv link._
+## Code repository
+[github.com/cruiseresearchgroup/CGM-JEPA](https://github.com/cruiseresearchgroup/CGM-JEPA)

baselines/gluformer.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2aaae7c1c7e538bb8fa07b56766166e66da25d27091a8f54d4b1c81d64ade12d
+size 1126455

baselines/ts2vec.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:44aac30803cfd98f978bb2c298a296a3f8bc8a2e413e950074c3f2f121ab2c38
+size 1215118

cgm_jepa/cgm_jepa.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:942570ac4a855b7ffe77d1f37f0f8806b01da712ab13b4f5aa25bfbbd0e93f22
+size 4026488

x_cgm_jepa/x_cgm_jepa.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4c412930a736e8ccf71b20a15904da68b604e88b29e54798c93c4c1d1334f08c
+size 4026658