rntc commited on
Commit
41d13ec
·
verified ·
1 Parent(s): b9e291a

Update model card: add model-index, datasets, training details, related models

Browse files
Files changed (1) hide show
  1. README.md +99 -12
README.md CHANGED
@@ -9,16 +9,94 @@ tags:
9
  - encoder
10
  - modernbert
11
  - fill-mask
 
 
12
  base_model:
13
  - answerdotai/ModernBERT-base
14
  pipeline_tag: fill-mask
15
  widget:
16
  - text: "The patient was diagnosed with [MASK] and started on antibiotics."
17
  - text: "Mitochondria is the powerhouse of the [MASK]."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  ---
19
 
20
  # cpt-en-base
21
 
 
22
 
23
  ## Table of Contents
24
 
@@ -31,9 +109,9 @@ widget:
31
 
32
  ## Model Summary
33
 
34
- cpt-en-base is an English biomedical encoder built by continued pretraining of [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base) using a **CLM detour** recipe. Instead of standard MLM continued pretraining, we temporarily switch to causal language modeling (CLM) before returning to MLM. This produces lasting representational changes in early transformer layers that improve downstream biomedical performance.
35
 
36
- cpt-en-base achieves **78.0% average F1** across 11 English biomedical benchmarks (5 Clinical + 6 BigBIO), the highest balanced score across both task families.
37
 
38
  | | |
39
  |---|---|
@@ -94,7 +172,7 @@ outputs = model(**inputs)
94
  # outputs.last_hidden_state: [batch, seq_len, 768]
95
  ```
96
 
97
- **Note:** cpt-en-base does not use token type IDs. You can omit the `token_type_ids` parameter.
98
 
99
  ## Training
100
 
@@ -105,7 +183,7 @@ outputs = model(**inputs)
105
  | PubMed | 60% | Biomedical abstracts |
106
  | Med-Inst | 20% | Medical instructions |
107
  | MIMIC | 20% | Clinical notes |
108
- | **Total** | **50B tokens** | Single epoch |
109
 
110
  ### Methodology
111
 
@@ -114,7 +192,7 @@ cpt-en-base is trained in two phases, initialized from [ModernBERT-base](https:/
114
  * **Phase 1 — CLM detour (50B tokens):** The bidirectional attention mask is replaced with a causal mask, and the model is trained with next-token prediction. This dense training signal (100% of positions) deeply modifies early transformer layers for domain adaptation.
115
  * **Phase 2 — MLM decay (5B tokens):** Bidirectional attention is restored, and the model is trained with masked language modeling at 15% masking. The learning rate decays from peak to 10% following a 1-sqrt schedule.
116
 
117
- Both phases use the same data mix. Training used AdamW (lr=2e-4, beta1=0.9, beta2=0.98), bf16 mixed precision, global batch size of 384 sequences (~3.1M tokens), on 4x H100 GPUs with [Composer](https://github.com/mosaicml/composer).
118
 
119
  ### Why a CLM Detour?
120
 
@@ -129,7 +207,7 @@ English biomedical benchmark results (11 tasks, 5 seeds per model):
129
  | Model | Ctx | ChemProt | Phenotype | COS | Social Hist. | DEID | **Avg** |
130
  |-------|-----|----------|-----------|-----|-------------|------|---------|
131
  | **cpt-en-base** | 8192 | 90.1 | **61.9** | **95.2** | 54.2 | **83.2** | **76.9** |
132
- | BioClinical-ModernBERT | 8192 | 90.0 | 60.7 | 94.8 | **56.0** | 81.8 | 76.7 |
133
  | PubMedBERT | 512 | **90.2** | 52.0 | 95.0 | 48.7 | 80.4 | 73.3 |
134
  | ModernBERT-base | 8192 | 89.5 | 48.4 | 94.0 | 53.1 | 78.3 | 72.7 |
135
 
@@ -138,7 +216,7 @@ English biomedical benchmark results (11 tasks, 5 seeds per model):
138
  | Model | Ctx | AnatEM | BC5CDR | JNLPBA | NCBI | GAD | HoC | **Avg** |
139
  |-------|-----|--------|--------|--------|------|-----|-----|---------|
140
  | **cpt-en-base** | 8192 | 81.0 | **89.1** | 74.5 | 80.1 | 78.8 | **70.0** | **78.9** |
141
- | BioClinical-ModernBERT | 8192 | 79.2 | 88.7 | 74.8 | 78.7 | 75.8 | 67.0 | 77.4 |
142
  | PubMedBERT | 512 | **83.3** | 89.7 | **74.9** | **82.1** | **79.3** | 71.0 | 80.1 |
143
  | ModernBERT-base | 8192 | 77.2 | 87.9 | 74.3 | 77.7 | 76.8 | 66.6 | 76.8 |
144
 
@@ -147,7 +225,7 @@ English biomedical benchmark results (11 tasks, 5 seeds per model):
147
  | Model | Clinical | BigBIO | **Overall** |
148
  |-------|----------|--------|-------------|
149
  | **cpt-en-base** | **76.9** | **78.9** | **78.0** |
150
- | BioClinical-ModernBERT | 76.7 | 77.4 | 77.0 |
151
  | PubMedBERT | 73.3 | 80.1 | 77.0 |
152
  | ModernBERT-base | 72.7 | 76.8 | 74.9 |
153
 
@@ -163,9 +241,18 @@ This model is designed for English biomedical and clinical NLP tasks:
163
 
164
  The 8,192-token context is important for long clinical documents (discharge summaries, pathology reports) that are truncated by 512-token models.
165
 
 
 
 
 
 
 
 
 
 
166
  ## Limitations
167
 
168
- - Trained on English biomedical text; not suitable for other languages without further adaptation. See [cpt-fr-base](https://huggingface.co/almanach/cpt-fr-base-base) for French.
169
  - Encoder model: produces contextualized representations, does not generate text.
170
  - Clinical text may contain sensitive patterns; users are responsible for compliance with applicable regulations (HIPAA, etc.).
171
  - The English CLM-MLM improvement (+0.3pp at Base scale) is smaller than in French (+2.9pp) and not statistically significant at Base scale (binomial p=0.27). The practical benefit is clearest at Large scale (+0.8pp) and on long-context tasks.
@@ -178,13 +265,13 @@ Apache 2.0
178
 
179
  ```bibtex
180
  @inproceedings{anonymous2026clm,
181
- title={Under review},
182
  author={Anonymous},
183
- booktitle={Under review},
184
  year={2026}
185
  }
186
  ```
187
 
188
  ## Acknowledgments
189
 
190
-
 
9
  - encoder
10
  - modernbert
11
  - fill-mask
12
+ datasets:
13
+ - rntc/biomed-enriched
14
  base_model:
15
  - answerdotai/ModernBERT-base
16
  pipeline_tag: fill-mask
17
  widget:
18
  - text: "The patient was diagnosed with [MASK] and started on antibiotics."
19
  - text: "Mitochondria is the powerhouse of the [MASK]."
20
+ model-index:
21
+ - name: cpt-en-base
22
+ results:
23
+ - task:
24
+ type: token-classification
25
+ name: NER
26
+ dataset:
27
+ name: AnatEM
28
+ type: bigbio/anatem
29
+ metrics:
30
+ - type: f1
31
+ value: 81.0
32
+ - task:
33
+ type: token-classification
34
+ name: NER
35
+ dataset:
36
+ name: BC5CDR
37
+ type: bigbio/bc5cdr
38
+ metrics:
39
+ - type: f1
40
+ value: 89.1
41
+ - task:
42
+ type: token-classification
43
+ name: NER
44
+ dataset:
45
+ name: JNLPBA
46
+ type: bigbio/jnlpba
47
+ metrics:
48
+ - type: f1
49
+ value: 74.5
50
+ - task:
51
+ type: token-classification
52
+ name: NER
53
+ dataset:
54
+ name: NCBI Disease
55
+ type: bigbio/ncbi_disease
56
+ metrics:
57
+ - type: f1
58
+ value: 80.1
59
+ - task:
60
+ type: text-classification
61
+ name: Text Classification
62
+ dataset:
63
+ name: GAD
64
+ type: bigbio/gad
65
+ metrics:
66
+ - type: f1
67
+ value: 78.8
68
+ - task:
69
+ type: text-classification
70
+ name: Text Classification
71
+ dataset:
72
+ name: HoC
73
+ type: bigbio/hallmarks_of_cancer
74
+ metrics:
75
+ - type: f1
76
+ value: 70.0
77
+ - task:
78
+ type: text-classification
79
+ name: Text Classification
80
+ dataset:
81
+ name: ChemProt
82
+ type: bigbio/chemprot
83
+ metrics:
84
+ - type: f1
85
+ value: 90.1
86
+ - task:
87
+ type: text-classification
88
+ name: Text Classification
89
+ dataset:
90
+ name: DEID
91
+ type: n2c2/2006-deid
92
+ metrics:
93
+ - type: f1
94
+ value: 83.2
95
  ---
96
 
97
  # cpt-en-base
98
 
99
+ *cpt-en is available in two sizes: [base](https://huggingface.co/rntc/cpt-en-base) (149M parameters) and [large](https://huggingface.co/rntc/cpt-en-large) (396M parameters). Our code will be released upon publication.*
100
 
101
  ## Table of Contents
102
 
 
109
 
110
  ## Model Summary
111
 
112
+ cpt-en is an English biomedical encoder built by continued pretraining of [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base) using a **CLM detour** recipe. Instead of standard MLM continued pretraining, we temporarily switch to causal language modeling (CLM) before returning to MLM. This produces lasting representational changes in early transformer layers that improve downstream biomedical performance.
113
 
114
+ cpt-en achieves **78.0% average F1** across 11 English biomedical benchmarks (5 Clinical + 6 BigBIO), the highest balanced score across both task families.
115
 
116
  | | |
117
  |---|---|
 
172
  # outputs.last_hidden_state: [batch, seq_len, 768]
173
  ```
174
 
175
+ **Note:** cpt-en does not use token type IDs. You can omit the `token_type_ids` parameter.
176
 
177
  ## Training
178
 
 
183
  | PubMed | 60% | Biomedical abstracts |
184
  | Med-Inst | 20% | Medical instructions |
185
  | MIMIC | 20% | Clinical notes |
186
+ | **Total** | **50B tokens** | |
187
 
188
  ### Methodology
189
 
 
192
  * **Phase 1 — CLM detour (50B tokens):** The bidirectional attention mask is replaced with a causal mask, and the model is trained with next-token prediction. This dense training signal (100% of positions) deeply modifies early transformer layers for domain adaptation.
193
  * **Phase 2 — MLM decay (5B tokens):** Bidirectional attention is restored, and the model is trained with masked language modeling at 15% masking. The learning rate decays from peak to 10% following a 1-sqrt schedule.
194
 
195
+ Both phases use the same data mix (55B tokens total). Training used AdamW (lr=2e-4, beta1=0.9, beta2=0.98), bf16 mixed precision, global batch size of 384 sequences (~3.1M tokens), on H100 80GB GPUs with [Composer](https://github.com/mosaicml/composer). Total training time: ~5 GPU-hours.
196
 
197
  ### Why a CLM Detour?
198
 
 
207
  | Model | Ctx | ChemProt | Phenotype | COS | Social Hist. | DEID | **Avg** |
208
  |-------|-----|----------|-----------|-----|-------------|------|---------|
209
  | **cpt-en-base** | 8192 | 90.1 | **61.9** | **95.2** | 54.2 | **83.2** | **76.9** |
210
+ | BioClinical-ModernBERT-base | 8192 | 90.0 | 60.7 | 94.8 | **56.0** | 81.8 | 76.7 |
211
  | PubMedBERT | 512 | **90.2** | 52.0 | 95.0 | 48.7 | 80.4 | 73.3 |
212
  | ModernBERT-base | 8192 | 89.5 | 48.4 | 94.0 | 53.1 | 78.3 | 72.7 |
213
 
 
216
  | Model | Ctx | AnatEM | BC5CDR | JNLPBA | NCBI | GAD | HoC | **Avg** |
217
  |-------|-----|--------|--------|--------|------|-----|-----|---------|
218
  | **cpt-en-base** | 8192 | 81.0 | **89.1** | 74.5 | 80.1 | 78.8 | **70.0** | **78.9** |
219
+ | BioClinical-ModernBERT-base | 8192 | 79.2 | 88.7 | 74.8 | 78.7 | 75.8 | 67.0 | 77.4 |
220
  | PubMedBERT | 512 | **83.3** | 89.7 | **74.9** | **82.1** | **79.3** | 71.0 | 80.1 |
221
  | ModernBERT-base | 8192 | 77.2 | 87.9 | 74.3 | 77.7 | 76.8 | 66.6 | 76.8 |
222
 
 
225
  | Model | Clinical | BigBIO | **Overall** |
226
  |-------|----------|--------|-------------|
227
  | **cpt-en-base** | **76.9** | **78.9** | **78.0** |
228
+ | BioClinical-ModernBERT-base | 76.7 | 77.4 | 77.0 |
229
  | PubMedBERT | 73.3 | 80.1 | 77.0 |
230
  | ModernBERT-base | 72.7 | 76.8 | 74.9 |
231
 
 
241
 
242
  The 8,192-token context is important for long clinical documents (discharge summaries, pathology reports) that are truncated by 512-token models.
243
 
244
+ ## Related Models
245
+
246
+ | Model | Language | Parameters |
247
+ |-------|----------|------------|
248
+ | [cpt-en-base](https://huggingface.co/rntc/cpt-en-base) | English | 149M |
249
+ | [cpt-en-large](https://huggingface.co/rntc/cpt-en-large) | English | 396M |
250
+ | [cpt-fr-base](https://huggingface.co/rntc/cpt-fr-base) | French | 150M |
251
+ | [cpt-fr-large](https://huggingface.co/rntc/cpt-fr-large) | French | 350M |
252
+
253
  ## Limitations
254
 
255
+ - Trained on English biomedical text; not suitable for other languages without further adaptation. See [cpt-fr](https://huggingface.co/rntc/cpt-fr-base) for French.
256
  - Encoder model: produces contextualized representations, does not generate text.
257
  - Clinical text may contain sensitive patterns; users are responsible for compliance with applicable regulations (HIPAA, etc.).
258
  - The English CLM-MLM improvement (+0.3pp at Base scale) is smaller than in French (+2.9pp) and not statistically significant at Base scale (binomial p=0.27). The practical benefit is clearest at Large scale (+0.8pp) and on long-context tasks.
 
265
 
266
  ```bibtex
267
  @inproceedings{anonymous2026clm,
268
+ title={A Causal Language Modeling Detour Improves Encoder Continued Pretraining},
269
  author={Anonymous},
270
+ booktitle={Proceedings of COLM},
271
  year={2026}
272
  }
273
  ```
274
 
275
  ## Acknowledgments
276
 
277
+ This work was performed using HPC resources.