rntc commited on
Commit
03d7c3d
·
verified ·
1 Parent(s): 6e7958f

Update README: state-of-the-art biomedical encoder release

Browse files
Files changed (1) hide show
  1. README.md +23 -23
README.md CHANGED
@@ -10,7 +10,7 @@ tags:
10
  - modernbert
11
  - fill-mask
12
  datasets:
13
- - rntc/biomed-enriched
14
  base_model:
15
  - answerdotai/ModernBERT-large
16
  pipeline_tag: fill-mask
@@ -18,7 +18,7 @@ widget:
18
  - text: "The patient was diagnosed with [MASK] and started on antibiotics."
19
  - text: "Mitochondria is the powerhouse of the [MASK]."
20
  model-index:
21
- - name: cpt-en-large
22
  results:
23
  - task:
24
  type: token-classification
@@ -94,9 +94,9 @@ model-index:
94
  value: 84.2
95
  ---
96
 
97
- # cpt-en-large
98
 
99
- *cpt-en is available in two sizes: [base](https://huggingface.co/rntc/cpt-en-base) (149M parameters) and [large](https://huggingface.co/rntc/cpt-en-large) (396M parameters). Our code will be released upon publication.*
100
 
101
  ## Table of Contents
102
 
@@ -109,9 +109,9 @@ model-index:
109
 
110
  ## Model Summary
111
 
112
- cpt-en-large is the Large variant of our English biomedical encoder, built by continued pretraining of [ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large) using a **CLM detour** recipe. Instead of standard MLM continued pretraining, we temporarily switch to causal language modeling (CLM) before returning to MLM.
113
 
114
- cpt-en-large achieves **78.7% average F1** across 11 English biomedical benchmarks, the highest overall score, outperforming both the MLM baseline (+0.8pp, 7/11 task wins) and all other models.
115
 
116
  | | |
117
  |---|---|
@@ -143,7 +143,7 @@ pip install flash-attn
143
  ```python
144
  from transformers import AutoTokenizer, AutoModelForMaskedLM
145
 
146
- model_id = "rntc/cpt-en-large"
147
  tokenizer = AutoTokenizer.from_pretrained(model_id)
148
  model = AutoModelForMaskedLM.from_pretrained(model_id)
149
 
@@ -162,7 +162,7 @@ print("Predicted token:", predicted_token)
162
  ```python
163
  from transformers import AutoTokenizer, AutoModel
164
 
165
- model_id = "rntc/cpt-en-large"
166
  tokenizer = AutoTokenizer.from_pretrained(model_id)
167
  model = AutoModel.from_pretrained(model_id)
168
 
@@ -172,7 +172,7 @@ outputs = model(**inputs)
172
  # outputs.last_hidden_state: [batch, seq_len, 1024]
173
  ```
174
 
175
- **Note:** cpt-en does not use token type IDs. You can omit the `token_type_ids` parameter.
176
 
177
  ## Training
178
 
@@ -187,7 +187,7 @@ outputs = model(**inputs)
187
 
188
  ### Methodology
189
 
190
- cpt-en-large is trained in two phases, initialized from [ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large):
191
 
192
  * **Phase 1 — CLM detour (50B tokens):** The bidirectional attention mask is replaced with a causal mask, and the model is trained with next-token prediction. This dense training signal (100% of positions) deeply modifies early transformer layers for domain adaptation.
193
  * **Phase 2 — MLM decay (5B tokens):** Bidirectional attention is restored, and the model is trained with masked language modeling at 15% masking. The learning rate decays from peak to 10% following a 1-sqrt schedule.
@@ -206,7 +206,7 @@ English biomedical benchmark results (11 tasks, 5 seeds per model):
206
 
207
  | Model | Ctx | ChemProt | Phenotype | COS | Social Hist. | DEID | **Avg** |
208
  |-------|-----|----------|-----------|-----|-------------|------|---------|
209
- | **cpt-en-large** | 8192 | 90.4 | 61.3 | 94.7 | **56.5** | **84.2** | **77.4** |
210
  | MLM baseline Large (ours) | 8192 | **90.5** | 61.0 | 94.9 | 55.0 | 82.3 | 76.7 |
211
  | BioClinical-ModernBERT-base | 8192 | 90.0 | 60.7 | 94.8 | 56.0 | 81.8 | 76.7 |
212
  | PubMedBERT | 512 | 90.2 | 52.0 | **95.0** | 48.7 | 80.4 | 73.3 |
@@ -215,7 +215,7 @@ English biomedical benchmark results (11 tasks, 5 seeds per model):
215
 
216
  | Model | Ctx | AnatEM | BC5CDR | JNLPBA | NCBI | GAD | HoC | **Avg** |
217
  |-------|-----|--------|--------|--------|------|-----|-----|---------|
218
- | **cpt-en-large** | 8192 | **83.2** | **89.8** | 75.3 | 81.7 | **79.7** | 69.3 | **79.8** |
219
  | MLM baseline Large (ours) | 8192 | 82.0 | 89.4 | **75.5** | **81.8** | 76.4 | 67.8 | 78.8 |
220
  | BioClinical-ModernBERT-base | 8192 | 79.2 | 88.7 | 74.8 | 78.7 | 75.8 | 67.0 | 77.4 |
221
  | PubMedBERT | 512 | 83.3 | 89.7 | 74.9 | 82.1 | 79.3 | **71.0** | 80.1 |
@@ -224,13 +224,13 @@ English biomedical benchmark results (11 tasks, 5 seeds per model):
224
 
225
  | Model | Clinical | BigBIO | **Overall** |
226
  |-------|----------|--------|-------------|
227
- | **cpt-en-large** | **77.4** | **79.8** | **78.7** |
228
  | MLM baseline Large (ours) | 76.7 | 78.8 | 77.9 |
229
- | cpt-en-base | 76.9 | 78.9 | 78.0 |
230
  | BioClinical-ModernBERT-base | 76.7 | 77.4 | 77.0 |
231
  | PubMedBERT | 73.3 | 80.1 | 77.0 |
232
 
233
- cpt-en-large achieves the highest overall score (78.7%), with the CLM benefit widening at Large scale (+0.8pp vs +0.3pp for Base). The model sets new state-of-the-art on DEID (84.2%), AnatEM (83.2%), and GAD (79.7%).
234
 
235
  ## Intended Use
236
 
@@ -246,14 +246,14 @@ The 8,192-token context is important for long clinical documents. The Large size
246
 
247
  | Model | Language | Parameters |
248
  |-------|----------|------------|
249
- | [cpt-en-base](https://huggingface.co/rntc/cpt-en-base) | English | 149M |
250
- | [cpt-en-large](https://huggingface.co/rntc/cpt-en-large) | English | 396M |
251
- | [cpt-fr-base](https://huggingface.co/rntc/cpt-fr-base) | French | 150M |
252
- | [cpt-fr-large](https://huggingface.co/rntc/cpt-fr-large) | French | 350M |
253
 
254
  ## Limitations
255
 
256
- - Trained on English biomedical text; not suitable for other languages without further adaptation. See [cpt-fr](https://huggingface.co/rntc/cpt-fr-base) for French.
257
  - Encoder model: produces contextualized representations, does not generate text.
258
  - Clinical text may contain sensitive patterns; users are responsible for compliance with applicable regulations (HIPAA, etc.).
259
  - Training data includes MIMIC clinical notes, which are de-identified but derived from real patient records.
@@ -265,9 +265,9 @@ Apache 2.0
265
  ## Citation
266
 
267
  ```bibtex
268
- @inproceedings{anonymous2026clm,
269
  title={A Causal Language Modeling Detour Improves Encoder Continued Pretraining},
270
- author={Anonymous},
271
  booktitle={Proceedings of COLM},
272
  year={2026}
273
  }
@@ -275,4 +275,4 @@ Apache 2.0
275
 
276
  ## Acknowledgments
277
 
278
- This work was performed using HPC resources.
 
10
  - modernbert
11
  - fill-mask
12
  datasets:
13
+ - almanach/Biomed-Enriched
14
  base_model:
15
  - answerdotai/ModernBERT-large
16
  pipeline_tag: fill-mask
 
18
  - text: "The patient was diagnosed with [MASK] and started on antibiotics."
19
  - text: "Mitochondria is the powerhouse of the [MASK]."
20
  model-index:
21
+ - name: ModernBERT-bio-large
22
  results:
23
  - task:
24
  type: token-classification
 
94
  value: 84.2
95
  ---
96
 
97
+ # ModernBERT-bio-large
98
 
99
+ *ModernBERT-bio is available in two sizes: [base](https://huggingface.co/almanach/ModernBERT-bio-base) (149M parameters) and [large](https://huggingface.co/almanach/ModernBERT-bio-large) (396M parameters). Our code is available in our [GitHub repository](https://github.com/Rian-T/colm2026-clm-detour).*
100
 
101
  ## Table of Contents
102
 
 
109
 
110
  ## Model Summary
111
 
112
+ ModernBERT-bio-large is the Large variant of our English biomedical encoder, built by continued pretraining of [ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large) using a **CLM detour** recipe. Instead of standard MLM continued pretraining, we temporarily switch to causal language modeling (CLM) before returning to MLM.
113
 
114
+ ModernBERT-bio-large achieves **78.7% average F1** across 11 English biomedical benchmarks, the highest overall score, outperforming both the MLM baseline (+0.8pp, 7/11 task wins) and all other models.
115
 
116
  | | |
117
  |---|---|
 
143
  ```python
144
  from transformers import AutoTokenizer, AutoModelForMaskedLM
145
 
146
+ model_id = "almanach/ModernBERT-bio-large"
147
  tokenizer = AutoTokenizer.from_pretrained(model_id)
148
  model = AutoModelForMaskedLM.from_pretrained(model_id)
149
 
 
162
  ```python
163
  from transformers import AutoTokenizer, AutoModel
164
 
165
+ model_id = "almanach/ModernBERT-bio-large"
166
  tokenizer = AutoTokenizer.from_pretrained(model_id)
167
  model = AutoModel.from_pretrained(model_id)
168
 
 
172
  # outputs.last_hidden_state: [batch, seq_len, 1024]
173
  ```
174
 
175
+ **Note:** ModernBERT-bio does not use token type IDs. You can omit the `token_type_ids` parameter.
176
 
177
  ## Training
178
 
 
187
 
188
  ### Methodology
189
 
190
+ ModernBERT-bio-large is trained in two phases, initialized from [ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large):
191
 
192
  * **Phase 1 — CLM detour (50B tokens):** The bidirectional attention mask is replaced with a causal mask, and the model is trained with next-token prediction. This dense training signal (100% of positions) deeply modifies early transformer layers for domain adaptation.
193
  * **Phase 2 — MLM decay (5B tokens):** Bidirectional attention is restored, and the model is trained with masked language modeling at 15% masking. The learning rate decays from peak to 10% following a 1-sqrt schedule.
 
206
 
207
  | Model | Ctx | ChemProt | Phenotype | COS | Social Hist. | DEID | **Avg** |
208
  |-------|-----|----------|-----------|-----|-------------|------|---------|
209
+ | **ModernBERT-bio-large** | 8192 | 90.4 | 61.3 | 94.7 | **56.5** | **84.2** | **77.4** |
210
  | MLM baseline Large (ours) | 8192 | **90.5** | 61.0 | 94.9 | 55.0 | 82.3 | 76.7 |
211
  | BioClinical-ModernBERT-base | 8192 | 90.0 | 60.7 | 94.8 | 56.0 | 81.8 | 76.7 |
212
  | PubMedBERT | 512 | 90.2 | 52.0 | **95.0** | 48.7 | 80.4 | 73.3 |
 
215
 
216
  | Model | Ctx | AnatEM | BC5CDR | JNLPBA | NCBI | GAD | HoC | **Avg** |
217
  |-------|-----|--------|--------|--------|------|-----|-----|---------|
218
+ | **ModernBERT-bio-large** | 8192 | **83.2** | **89.8** | 75.3 | 81.7 | **79.7** | 69.3 | **79.8** |
219
  | MLM baseline Large (ours) | 8192 | 82.0 | 89.4 | **75.5** | **81.8** | 76.4 | 67.8 | 78.8 |
220
  | BioClinical-ModernBERT-base | 8192 | 79.2 | 88.7 | 74.8 | 78.7 | 75.8 | 67.0 | 77.4 |
221
  | PubMedBERT | 512 | 83.3 | 89.7 | 74.9 | 82.1 | 79.3 | **71.0** | 80.1 |
 
224
 
225
  | Model | Clinical | BigBIO | **Overall** |
226
  |-------|----------|--------|-------------|
227
+ | **ModernBERT-bio-large** | **77.4** | **79.8** | **78.7** |
228
  | MLM baseline Large (ours) | 76.7 | 78.8 | 77.9 |
229
+ | ModernBERT-bio-base | 76.9 | 78.9 | 78.0 |
230
  | BioClinical-ModernBERT-base | 76.7 | 77.4 | 77.0 |
231
  | PubMedBERT | 73.3 | 80.1 | 77.0 |
232
 
233
+ ModernBERT-bio-large achieves the highest overall score (78.7%), with the CLM benefit widening at Large scale (+0.8pp vs +0.3pp for Base). The model sets new state-of-the-art on DEID (84.2%), BC5CDR (89.8%), GAD (79.7%), and Social History (56.5%).
234
 
235
  ## Intended Use
236
 
 
246
 
247
  | Model | Language | Parameters |
248
  |-------|----------|------------|
249
+ | [ModernBERT-bio-base](https://huggingface.co/almanach/ModernBERT-bio-base) | English | 149M |
250
+ | [ModernBERT-bio-large](https://huggingface.co/almanach/ModernBERT-bio-large) | English | 396M |
251
+ | [ModernCamemBERT-bio-base](https://huggingface.co/almanach/ModernCamemBERT-bio-base) | French | 150M |
252
+ | [ModernCamemBERT-bio-large](https://huggingface.co/almanach/ModernCamemBERT-bio-large) | French | 350M |
253
 
254
  ## Limitations
255
 
256
+ - Trained on English biomedical text; not suitable for other languages without further adaptation. See [ModernCamemBERT-bio](https://huggingface.co/almanach/ModernCamemBERT-bio-base) for French.
257
  - Encoder model: produces contextualized representations, does not generate text.
258
  - Clinical text may contain sensitive patterns; users are responsible for compliance with applicable regulations (HIPAA, etc.).
259
  - Training data includes MIMIC clinical notes, which are de-identified but derived from real patient records.
 
265
  ## Citation
266
 
267
  ```bibtex
268
+ @inproceedings{touchent2026clm,
269
  title={A Causal Language Modeling Detour Improves Encoder Continued Pretraining},
270
+ author={Touchent, Rian and de la Clergerie, {\'E}ric},
271
  booktitle={Proceedings of COLM},
272
  year={2026}
273
  }
 
275
 
276
  ## Acknowledgments
277
 
278
+ This work was performed using HPC resources from GENCI-IDRIS (Grant 2024-AD011015883).