CGM-JEPA / README.md

v2: safetensors + config.json for Encoder.from_pretrained()

076b35e verified 2 days ago

8.09 kB

	---
	license: mit
	language:
	- en
	library_name: pytorch
	pipeline_tag: feature-extraction
	tags:
	- cgm
	- continuous-glucose-monitor
	- self-supervised-learning
	- jepa
	- time-series
	- masked-prediction
	- biosignal
	- healthcare
	- pretrained-encoder
	---

	# CGM-JEPA Pretrained Encoders

	Frozen self-supervised encoder weights from the paper CGM-JEPA: Learning Consistent Continuous Glucose Monitor Representations via Predictive Self-Supervised Pretraining. The repo contains the exact checkpoints used to produce Tables 1–8 of the paper for both the paper's main contributions (CGM-JEPA, X-CGM-JEPA) and the two re-pretrained baselines (GluFormer, TS2Vec).

	> Companion repos: pretraining dataset [`CRUISEResearchGroup/CGM-JEPA-Pretraining`](https://huggingface.co/datasets/CRUISEResearchGroup/CGM-JEPA-Pretraining), labeled splits [`CRUISEResearchGroup/CGM-JEPA-Downstream`](https://huggingface.co/datasets/CRUISEResearchGroup/CGM-JEPA-Downstream), code [github.com/cruiseresearchgroup/CGM-JEPA](https://github.com/cruiseresearchgroup/CGM-JEPA).

	> MOMENT and Mantis are not redistributed here. Those baselines are loaded directly from their upstream HF repos (`AutonLab/MOMENT-1-{small,large}`, `paris-noah/Mantis-8M`) by the eval pipeline.

	## Quick start

	```bash
	huggingface-cli download CRUISEResearchGroup/CGM-JEPA --local-dir Output
	```

	Then from the [code repository](https://github.com/cruiseresearchgroup/CGM-JEPA):

	```bash
	# Reproduce paper Tables 1–6
	python scripts/run_all_eval.py
	```

	The downstream eval will load all four checkpoints automatically from the subdirectories below.

	## Layout

	```
	.
	├── cgm_jepa/
	│ ├── model.safetensors
	│ └── config.json
	├── x_cgm_jepa/
	│ ├── model.safetensors
	│ └── config.json
	└── baselines/
	├── gluformer.pt
	└── ts2vec.pkl
	```

	`cgm_jepa/` and `x_cgm_jepa/` use the standard `PyTorchModelHubMixin` layout — `model.safetensors` for weights, `config.json` for architecture hyperparameters — so they load via the standard `from_pretrained` one-liner (see [Loading examples](#loading-examples)).

	`baselines/gluformer.pt` is `{"encoder": state_dict}` and `baselines/ts2vec.pkl` is a full pickled `TS2Vec` model object (per the upstream library's convention). Their architectures are documented in the [Architectures](#architectures) section.

	### Important note on the baselines

	`gluformer.pt` and `ts2vec.pkl` are not vendored from upstream releases of those methods. They were re-pretrained on the same open CGM corpus and compute budget as CGM-JEPA / X-CGM-JEPA (Stanford + Colas, 101 epochs, batch 128, lr 1e-4, seed 43) so that the comparison in the paper isolates the pretraining objective rather than mixing in corpus or compute differences. Use these checkpoints when reproducing paper numbers; for other settings, prefer the original authors' releases.

	## Architectures

	### `cgm_jepa/cgm_jepa.pt` and `x_cgm_jepa/x_cgm_jepa.pt`

	Both use the same `models.encoder.Encoder` class with identical hyperparameters; only the pretraining objective differs. At downstream / inference time only the temporal encoder is used, so the two checkpoints are drop-in interchangeable.

	\| Field \| Value \|
	\|---\|---\|
	\| `patch_size` \| 12 \|
	\| `encoder_kernel_size` \| 3 \|
	\| `encoder_embed_dim` \| 96 \|
	\| `encoder_embed_bias` \| `True` \|
	\| `encoder_nhead` \| 6 \|
	\| `encoder_num_layers` \| 3 \|
	\| `encoder_dropout` \| 0.0 \|

	Input: a tensor of shape `(B, num_patches, patch_size)` (raw glucose values, z-scored).
	Output: per-patch embedding of shape `(B, num_patches, embed_dim)`. Pool with `.mean(dim=1)` for a single embedding per sample.

	X-CGM-JEPA adds a second pretraining branch that predicts Glucodensity image patches; only the temporal encoder is loaded at inference.

	### `baselines/gluformer.pt`

	`models.gluformer.GluFormer`:

	\| Field \| Value \|
	\|---\|---\|
	\| `vocab_size` \| 278 \|
	\| `embed_dim` \| 96 \|
	\| `nhead` \| 6 \|
	\| `num_layers` \| 3 \|
	\| `dim_feedforward` \| 192 \|
	\| `max_seq_length` \| 25000 \|
	\| `dropout` \| 0.0 \|
	\| `pad_token` \| 278 (= `vocab_size`) \|

	Input: a tensor of integer bin indices in `[0, vocab_size)` (raw glucose discretized into the 40–320 mg/dL range with width `(320 − 40) / vocab_size`). The downstream pipeline detaches GluFormer's output head and uses only the encoder embedding.

	### `baselines/ts2vec.pkl`

	`models.ts2vec.TS2Vec` (loaded via `eval/baseline_utils/ts2vec_utils.py:load_pretrained_ts2vec`):

	\| Field \| Value \|
	\|---\|---\|
	\| `input_dims` \| 1 \|
	\| `output_dims` \| 96 \|
	\| `hidden_dims` \| 64 \|
	\| `depth` \| 10 \|

	Saved as a Python pickle of the full model object, matching the upstream `ts2vec` library convention.

	## Loading examples

	### CGM-JEPA / X-CGM-JEPA — `from_pretrained` one-liner

	`Encoder` is a `PyTorchModelHubMixin` subclass, so the architecture hyperparameters and weights load in a single call directly from this repo:

	```python
	from models.encoder import Encoder

	encoder = Encoder.from_pretrained("CRUISEResearchGroup/CGM-JEPA", subfolder="cgm_jepa")
	encoder.eval()

	# X-CGM-JEPA: same call, different subfolder
	encoder_x = Encoder.from_pretrained("CRUISEResearchGroup/CGM-JEPA", subfolder="x_cgm_jepa")
	```

	`config.json` for each subfolder is auto-introspected from `Encoder.__init__`, so no architecture wiring is needed on the user side.

	### From the CGM-JEPA code repository

	`config/model_configs.py` looks for these checkpoints under `Output/cgm_jepa/`, `Output/x_cgm_jepa/`, and `Output/baselines/`. The `huggingface-cli download CRUISEResearchGroup/CGM-JEPA --local-dir Output` flow above produces exactly that structure, so the eval pipeline picks them up automatically.

	### Standalone PyTorch — GluFormer

	```python
	import torch
	import torch.nn as nn
	from models.gluformer.gluformer import GluFormer

	vocab_size = 278
	gluformer = GluFormer(
	vocab_size=vocab_size,
	embed_dim=96,
	nhead=6,
	num_layers=3,
	dim_feedforward=192,
	max_seq_length=25000,
	dropout=0.0,
	pad_token=vocab_size,
	)
	gluformer.load_state_dict(
	torch.load("Output/baselines/gluformer.pt", map_location="cpu")["encoder"]
	)
	gluformer.output_head = nn.Identity() # discard the LM head for embedding extraction
	gluformer.eval()
	```

	### Standalone PyTorch — TS2Vec

	```python
	from eval.baseline_utils.ts2vec_utils import load_pretrained_ts2vec

	ts2vec = load_pretrained_ts2vec(
	checkpoint_path="Output/baselines/ts2vec.pkl",
	device="cpu",
	input_dims=1,
	output_dims=96,
	hidden_dims=64,
	depth=10,
	)
	```

	## Pretraining

	All four encoders were pretrained on the [CGM-JEPA pretraining corpus](https://huggingface.co/datasets/CRUISEResearchGroup/CGM-JEPA-Pretraining) under identical conditions:

	\| Setting \| Value \|
	\|---\|---\|
	\| Corpus \| 228 subjects (22 Stanford + 206 Colas), 389,365 readings at 5-min sampling \|
	\| Window length \| 288 timesteps (24 hours) \|
	\| Masking ratio \| 0.25 \|
	\| Epochs \| 101 \|
	\| Batch size \| 128 \|
	\| Learning rate \| 1e-4 \|
	\| Random seed \| 43 \|

	See [`config/config_pretrain.py`](https://github.com/cruiseresearchgroup/CGM-JEPA/blob/main/config/config_pretrain.py) for the full configuration.

	## Intended use

	- Frozen feature extraction from raw CGM windows (24-hour, 5-min sampled, 288 timesteps).
	- Linear-probe or shallow-classifier downstream evaluation, especially the IR / β-cell dysfunction tasks in the paper.
	- Comparison baseline for new CGM representation methods, with identical pretraining conditions across all four encoders shipped here.

	## License & attribution

	Released under the MIT license. When using these weights, please cite:

	1. Our paper (citation TBD; see code repo).
	2. The two upstream pretraining datasets — Metwally et al. 2025 (Nature Biomedical Engineering) and Colas et al. 2019 (PLOS ONE).
	3. The original baseline papers when using `gluformer.pt` or `ts2vec.pkl`.

	## Citation

	> _Citation block to be filled once the CGM-JEPA paper has a stable venue / arXiv link._

	## Code repository

	[github.com/cruiseresearchgroup/CGM-JEPA](https://github.com/cruiseresearchgroup/CGM-JEPA)