hadamelino commited on
Commit
92f14ba
·
verified ·
1 Parent(s): 7e0f79c

Initial release: CGM-JEPA + X-CGM-JEPA + retrained baselines

Browse files
README.md CHANGED
@@ -1,3 +1,224 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ library_name: pytorch
6
+ pipeline_tag: feature-extraction
7
+ tags:
8
+ - cgm
9
+ - continuous-glucose-monitor
10
+ - self-supervised-learning
11
+ - jepa
12
+ - time-series
13
+ - masked-prediction
14
+ - biosignal
15
+ - healthcare
16
+ - pretrained-encoder
17
+ ---
18
+
19
+ # CGM-JEPA Pretrained Encoders
20
+
21
+ Frozen self-supervised encoder weights from the paper *CGM-JEPA: Learning Consistent Continuous Glucose Monitor Representations via Predictive Self-Supervised Pretraining*. The repo contains the **exact checkpoints used to produce Tables 1–8 of the paper** for both the paper's main contributions (CGM-JEPA, X-CGM-JEPA) and the two re-pretrained baselines (GluFormer, TS2Vec).
22
+
23
+ > Companion repos: pretraining dataset [`CRUISEResearchGroup/CGM-JEPA-Pretraining`](https://huggingface.co/datasets/CRUISEResearchGroup/CGM-JEPA-Pretraining), labeled splits [`CRUISEResearchGroup/CGM-JEPA-Downstream`](https://huggingface.co/datasets/CRUISEResearchGroup/CGM-JEPA-Downstream), code [github.com/cruiseresearchgroup/CGM-JEPA](https://github.com/cruiseresearchgroup/CGM-JEPA).
24
+
25
+ > **MOMENT and Mantis are not redistributed here.** Those baselines are loaded directly from their upstream HF repos (`AutonLab/MOMENT-1-{small,large}`, `paris-noah/Mantis-8M`) by the eval pipeline.
26
+
27
+ ## Quick start
28
+
29
+ ```bash
30
+ huggingface-cli download CRUISEResearchGroup/CGM-JEPA --local-dir Output
31
+ ```
32
+
33
+ Then from the [code repository](https://github.com/cruiseresearchgroup/CGM-JEPA):
34
+
35
+ ```bash
36
+ # Reproduce paper Tables 1–6
37
+ python scripts/run_all_eval.py
38
+ ```
39
+
40
+ The downstream eval will load all four checkpoints automatically from the subdirectories below.
41
+
42
+ ## Layout
43
+
44
+ ```
45
+ .
46
+ ├── cgm_jepa/
47
+ │ └── cgm_jepa.pt # 3.8 MB — paper main contribution
48
+ ├── x_cgm_jepa/
49
+ │ └── x_cgm_jepa.pt # 3.8 MB — paper main contribution
50
+ └── baselines/
51
+ ├── gluformer.pt # 1.1 MB — baseline, re-pretrained on the open CGM corpus
52
+ └── ts2vec.pkl # 1.2 MB — baseline, re-pretrained on the open CGM corpus
53
+ ```
54
+
55
+ Each `.pt` file contains `{"encoder": state_dict}`. `ts2vec.pkl` is a full pickled `TS2Vec` model object (per the upstream library's convention), not a state dict. Architecture hyperparameters needed to instantiate each model are listed in the [Architectures](#architectures) section below.
56
+
57
+ ### Important note on the baselines
58
+
59
+ `gluformer.pt` and `ts2vec.pkl` are **not** vendored from upstream releases of those methods. They were **re-pretrained on the same open CGM corpus and compute budget as CGM-JEPA / X-CGM-JEPA** (Stanford + Colas, 101 epochs, batch 128, lr 1e-4, seed 43) so that the comparison in the paper isolates the pretraining objective rather than mixing in corpus or compute differences. Use these checkpoints when reproducing paper numbers; for other settings, prefer the original authors' releases.
60
+
61
+ ## Architectures
62
+
63
+ ### `cgm_jepa/cgm_jepa.pt` and `x_cgm_jepa/x_cgm_jepa.pt`
64
+
65
+ Both use the same `models.encoder.Encoder` class with identical hyperparameters; only the pretraining objective differs. At downstream / inference time only the temporal encoder is used, so the two checkpoints are drop-in interchangeable.
66
+
67
+ | Field | Value |
68
+ |---|---|
69
+ | `patch_size` | 12 |
70
+ | `encoder_kernel_size` | 3 |
71
+ | `encoder_embed_dim` | 96 |
72
+ | `encoder_embed_bias` | `True` |
73
+ | `encoder_nhead` | 6 |
74
+ | `encoder_num_layers` | 3 |
75
+ | `encoder_dropout` | 0.0 |
76
+
77
+ Input: a tensor of shape `(B, num_patches, patch_size)` (raw glucose values, z-scored).
78
+ Output: per-patch embedding of shape `(B, num_patches, embed_dim)`. Pool with `.mean(dim=1)` for a single embedding per sample.
79
+
80
+ X-CGM-JEPA adds a second pretraining branch that predicts Glucodensity image patches; only the temporal encoder is loaded at inference.
81
+
82
+ ### `baselines/gluformer.pt`
83
+
84
+ `models.gluformer.GluFormer`:
85
+
86
+ | Field | Value |
87
+ |---|---|
88
+ | `vocab_size` | **278** (older checkpoint — pre-dates the current 280-bin default in the code) |
89
+ | `embed_dim` | 96 |
90
+ | `nhead` | 6 |
91
+ | `num_layers` | 3 |
92
+ | `dim_feedforward` | 192 |
93
+ | `max_seq_length` | 25000 |
94
+ | `dropout` | 0.0 |
95
+ | `pad_token` | 278 (= `vocab_size`) |
96
+
97
+ Input: a tensor of integer bin indices in `[0, vocab_size)` (raw glucose discretized into the 40–320 mg/dL range with width `(320 − 40) / vocab_size`). The downstream pipeline detaches GluFormer's output head and uses only the encoder embedding.
98
+
99
+ ### `baselines/ts2vec.pkl`
100
+
101
+ `models.ts2vec.TS2Vec` (loaded via `eval/baseline_utils/ts2vec_utils.py:load_pretrained_ts2vec`):
102
+
103
+ | Field | Value |
104
+ |---|---|
105
+ | `input_dims` | 1 |
106
+ | `output_dims` | 96 |
107
+ | `hidden_dims` | 64 |
108
+ | `depth` | 10 |
109
+
110
+ Saved as a Python pickle of the full model object, matching the upstream `ts2vec` library convention.
111
+
112
+ ## Loading examples
113
+
114
+ ### From the CGM-JEPA code repository (recommended)
115
+
116
+ `config/model_configs.py` already looks for these checkpoints under `Output/cgm_jepa/`, `Output/x_cgm_jepa/`, and `Output/baselines/`. Place the downloaded files there (the `huggingface-cli download` command above does this automatically) and the eval pipeline picks them up.
117
+
118
+ ### Standalone PyTorch — CGM-JEPA / X-CGM-JEPA
119
+
120
+ ```python
121
+ import torch
122
+ from models.encoder import Encoder
123
+
124
+ encoder = Encoder(
125
+ dim_in=12, # patch_size
126
+ kernel_size=3, # encoder_kernel_size
127
+ embed_dim=96, # encoder_embed_dim
128
+ embed_bias=True, # encoder_embed_bias
129
+ nhead=6, # encoder_nhead
130
+ num_layers=3, # encoder_num_layers
131
+ jepa=False, # disable JEPA-specific heads for inference
132
+ )
133
+ encoder.load_state_dict(
134
+ torch.load("Output/cgm_jepa/cgm_jepa.pt", map_location="cpu")["encoder"],
135
+ strict=False,
136
+ )
137
+ encoder.eval()
138
+ # For X-CGM-JEPA, swap the path to "Output/x_cgm_jepa/x_cgm_jepa.pt".
139
+ ```
140
+
141
+ ### Standalone PyTorch — GluFormer
142
+
143
+ ```python
144
+ import torch
145
+ import torch.nn as nn
146
+ from models.gluformer.gluformer import GluFormer
147
+
148
+ vocab_size = 278
149
+ gluformer = GluFormer(
150
+ vocab_size=vocab_size,
151
+ embed_dim=96,
152
+ nhead=6,
153
+ num_layers=3,
154
+ dim_feedforward=192,
155
+ max_seq_length=25000,
156
+ dropout=0.0,
157
+ pad_token=vocab_size,
158
+ )
159
+ gluformer.load_state_dict(
160
+ torch.load("Output/baselines/gluformer.pt", map_location="cpu")["encoder"]
161
+ )
162
+ gluformer.output_head = nn.Identity() # discard the LM head for embedding extraction
163
+ gluformer.eval()
164
+ ```
165
+
166
+ ### Standalone PyTorch — TS2Vec
167
+
168
+ ```python
169
+ from eval.baseline_utils.ts2vec_utils import load_pretrained_ts2vec
170
+
171
+ ts2vec = load_pretrained_ts2vec(
172
+ checkpoint_path="Output/baselines/ts2vec.pkl",
173
+ device="cpu",
174
+ input_dims=1,
175
+ output_dims=96,
176
+ hidden_dims=64,
177
+ depth=10,
178
+ )
179
+ ```
180
+
181
+ ## Pretraining
182
+
183
+ All four encoders were pretrained on the [CGM-JEPA pretraining corpus](https://huggingface.co/datasets/CRUISEResearchGroup/CGM-JEPA-Pretraining) under identical conditions:
184
+
185
+ | Setting | Value |
186
+ |---|---|
187
+ | Corpus | 228 subjects (22 Stanford + 206 Colas), 389,365 readings at 5-min sampling |
188
+ | Window length | 288 timesteps (24 hours) |
189
+ | Masking ratio | 0.25 |
190
+ | Epochs | 101 |
191
+ | Batch size | 128 |
192
+ | Learning rate | 1e-4 |
193
+ | Random seed | 43 |
194
+
195
+ See [`config/config_pretrain.py`](https://github.com/cruiseresearchgroup/CGM-JEPA/blob/main/config/config_pretrain.py) for the full configuration.
196
+
197
+ ## Intended use
198
+
199
+ - **Frozen feature extraction** from raw CGM windows (24-hour, 5-min sampled, 288 timesteps).
200
+ - **Linear-probe or shallow-classifier downstream evaluation**, especially the IR / β-cell dysfunction tasks in the paper.
201
+ - **Comparison baseline** for new CGM representation methods, with identical pretraining conditions across all four encoders shipped here.
202
+
203
+ ## Limitations
204
+
205
+ - **Window size**: encoders were trained on 288-timestep (24-hour) windows. Behavior at substantially longer or shorter sequence lengths has not been validated.
206
+ - **Sensor distribution shift**: the underlying pretraining corpus is from Dexcom-class sensors. Behavior on other devices is unverified.
207
+ - **Not clinical**: outputs are continuous embeddings, not clinical decisions. Use only as a feature extractor in research contexts.
208
+ - **Baseline scope**: the GluFormer and TS2Vec checkpoints reproduce each method's published architecture under our pretraining corpus; they are not vendored from the original authors' releases. For non-paper use cases, prefer the upstream checkpoints.
209
+
210
+ ## License & attribution
211
+
212
+ Released under the **MIT license**. When using these weights, please cite:
213
+
214
+ 1. Our paper (citation TBD; see code repo).
215
+ 2. The two upstream pretraining datasets — Metwally et al. 2025 (*Nature Biomedical Engineering*) and Colas et al. 2019 (*PLOS ONE*).
216
+ 3. The original baseline papers when using `gluformer.pt` or `ts2vec.pkl`.
217
+
218
+ ## Citation
219
+
220
+ > _Citation block to be filled once the CGM-JEPA paper has a stable venue / arXiv link._
221
+
222
+ ## Code repository
223
+
224
+ [github.com/cruiseresearchgroup/CGM-JEPA](https://github.com/cruiseresearchgroup/CGM-JEPA)
baselines/gluformer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2aaae7c1c7e538bb8fa07b56766166e66da25d27091a8f54d4b1c81d64ade12d
3
+ size 1126455
baselines/ts2vec.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:44aac30803cfd98f978bb2c298a296a3f8bc8a2e413e950074c3f2f121ab2c38
3
+ size 1215118
cgm_jepa/cgm_jepa.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:942570ac4a855b7ffe77d1f37f0f8806b01da712ab13b4f5aa25bfbbd0e93f22
3
+ size 4026488
x_cgm_jepa/x_cgm_jepa.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4c412930a736e8ccf71b20a15904da68b604e88b29e54798c93c4c1d1334f08c
3
+ size 4026658