rntc commited on
Commit
84d8e5d
·
verified ·
1 Parent(s): 41d13ec

Update README: state-of-the-art biomedical encoder release

Browse files
Files changed (1) hide show
  1. README.md +22 -22
README.md CHANGED
@@ -10,7 +10,7 @@ tags:
10
  - modernbert
11
  - fill-mask
12
  datasets:
13
- - rntc/biomed-enriched
14
  base_model:
15
  - answerdotai/ModernBERT-base
16
  pipeline_tag: fill-mask
@@ -18,7 +18,7 @@ widget:
18
  - text: "The patient was diagnosed with [MASK] and started on antibiotics."
19
  - text: "Mitochondria is the powerhouse of the [MASK]."
20
  model-index:
21
- - name: cpt-en-base
22
  results:
23
  - task:
24
  type: token-classification
@@ -94,9 +94,9 @@ model-index:
94
  value: 83.2
95
  ---
96
 
97
- # cpt-en-base
98
 
99
- *cpt-en is available in two sizes: [base](https://huggingface.co/rntc/cpt-en-base) (149M parameters) and [large](https://huggingface.co/rntc/cpt-en-large) (396M parameters). Our code will be released upon publication.*
100
 
101
  ## Table of Contents
102
 
@@ -109,9 +109,9 @@ model-index:
109
 
110
  ## Model Summary
111
 
112
- cpt-en is an English biomedical encoder built by continued pretraining of [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base) using a **CLM detour** recipe. Instead of standard MLM continued pretraining, we temporarily switch to causal language modeling (CLM) before returning to MLM. This produces lasting representational changes in early transformer layers that improve downstream biomedical performance.
113
 
114
- cpt-en achieves **78.0% average F1** across 11 English biomedical benchmarks (5 Clinical + 6 BigBIO), the highest balanced score across both task families.
115
 
116
  | | |
117
  |---|---|
@@ -143,7 +143,7 @@ pip install flash-attn
143
  ```python
144
  from transformers import AutoTokenizer, AutoModelForMaskedLM
145
 
146
- model_id = "rntc/cpt-en-base"
147
  tokenizer = AutoTokenizer.from_pretrained(model_id)
148
  model = AutoModelForMaskedLM.from_pretrained(model_id)
149
 
@@ -162,7 +162,7 @@ print("Predicted token:", predicted_token)
162
  ```python
163
  from transformers import AutoTokenizer, AutoModel
164
 
165
- model_id = "rntc/cpt-en-base"
166
  tokenizer = AutoTokenizer.from_pretrained(model_id)
167
  model = AutoModel.from_pretrained(model_id)
168
 
@@ -172,7 +172,7 @@ outputs = model(**inputs)
172
  # outputs.last_hidden_state: [batch, seq_len, 768]
173
  ```
174
 
175
- **Note:** cpt-en does not use token type IDs. You can omit the `token_type_ids` parameter.
176
 
177
  ## Training
178
 
@@ -187,7 +187,7 @@ outputs = model(**inputs)
187
 
188
  ### Methodology
189
 
190
- cpt-en-base is trained in two phases, initialized from [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base):
191
 
192
  * **Phase 1 — CLM detour (50B tokens):** The bidirectional attention mask is replaced with a causal mask, and the model is trained with next-token prediction. This dense training signal (100% of positions) deeply modifies early transformer layers for domain adaptation.
193
  * **Phase 2 — MLM decay (5B tokens):** Bidirectional attention is restored, and the model is trained with masked language modeling at 15% masking. The learning rate decays from peak to 10% following a 1-sqrt schedule.
@@ -206,7 +206,7 @@ English biomedical benchmark results (11 tasks, 5 seeds per model):
206
 
207
  | Model | Ctx | ChemProt | Phenotype | COS | Social Hist. | DEID | **Avg** |
208
  |-------|-----|----------|-----------|-----|-------------|------|---------|
209
- | **cpt-en-base** | 8192 | 90.1 | **61.9** | **95.2** | 54.2 | **83.2** | **76.9** |
210
  | BioClinical-ModernBERT-base | 8192 | 90.0 | 60.7 | 94.8 | **56.0** | 81.8 | 76.7 |
211
  | PubMedBERT | 512 | **90.2** | 52.0 | 95.0 | 48.7 | 80.4 | 73.3 |
212
  | ModernBERT-base | 8192 | 89.5 | 48.4 | 94.0 | 53.1 | 78.3 | 72.7 |
@@ -215,7 +215,7 @@ English biomedical benchmark results (11 tasks, 5 seeds per model):
215
 
216
  | Model | Ctx | AnatEM | BC5CDR | JNLPBA | NCBI | GAD | HoC | **Avg** |
217
  |-------|-----|--------|--------|--------|------|-----|-----|---------|
218
- | **cpt-en-base** | 8192 | 81.0 | **89.1** | 74.5 | 80.1 | 78.8 | **70.0** | **78.9** |
219
  | BioClinical-ModernBERT-base | 8192 | 79.2 | 88.7 | 74.8 | 78.7 | 75.8 | 67.0 | 77.4 |
220
  | PubMedBERT | 512 | **83.3** | 89.7 | **74.9** | **82.1** | **79.3** | 71.0 | 80.1 |
221
  | ModernBERT-base | 8192 | 77.2 | 87.9 | 74.3 | 77.7 | 76.8 | 66.6 | 76.8 |
@@ -224,12 +224,12 @@ English biomedical benchmark results (11 tasks, 5 seeds per model):
224
 
225
  | Model | Clinical | BigBIO | **Overall** |
226
  |-------|----------|--------|-------------|
227
- | **cpt-en-base** | **76.9** | **78.9** | **78.0** |
228
  | BioClinical-ModernBERT-base | 76.7 | 77.4 | 77.0 |
229
  | PubMedBERT | 73.3 | 80.1 | 77.0 |
230
  | ModernBERT-base | 72.7 | 76.8 | 74.9 |
231
 
232
- cpt-en-base achieves the highest balanced score (78.0%) across both Clinical and BigBIO task families. PubMedBERT scores higher on short-context BigBIO NER tasks but falls behind on long-context tasks (Phenotype: 52.0% vs 61.9%).
233
 
234
  ## Intended Use
235
 
@@ -245,14 +245,14 @@ The 8,192-token context is important for long clinical documents (discharge summ
245
 
246
  | Model | Language | Parameters |
247
  |-------|----------|------------|
248
- | [cpt-en-base](https://huggingface.co/rntc/cpt-en-base) | English | 149M |
249
- | [cpt-en-large](https://huggingface.co/rntc/cpt-en-large) | English | 396M |
250
- | [cpt-fr-base](https://huggingface.co/rntc/cpt-fr-base) | French | 150M |
251
- | [cpt-fr-large](https://huggingface.co/rntc/cpt-fr-large) | French | 350M |
252
 
253
  ## Limitations
254
 
255
- - Trained on English biomedical text; not suitable for other languages without further adaptation. See [cpt-fr](https://huggingface.co/rntc/cpt-fr-base) for French.
256
  - Encoder model: produces contextualized representations, does not generate text.
257
  - Clinical text may contain sensitive patterns; users are responsible for compliance with applicable regulations (HIPAA, etc.).
258
  - The English CLM-MLM improvement (+0.3pp at Base scale) is smaller than in French (+2.9pp) and not statistically significant at Base scale (binomial p=0.27). The practical benefit is clearest at Large scale (+0.8pp) and on long-context tasks.
@@ -264,9 +264,9 @@ Apache 2.0
264
  ## Citation
265
 
266
  ```bibtex
267
- @inproceedings{anonymous2026clm,
268
  title={A Causal Language Modeling Detour Improves Encoder Continued Pretraining},
269
- author={Anonymous},
270
  booktitle={Proceedings of COLM},
271
  year={2026}
272
  }
@@ -274,4 +274,4 @@ Apache 2.0
274
 
275
  ## Acknowledgments
276
 
277
- This work was performed using HPC resources.
 
10
  - modernbert
11
  - fill-mask
12
  datasets:
13
+ - almanach/Biomed-Enriched
14
  base_model:
15
  - answerdotai/ModernBERT-base
16
  pipeline_tag: fill-mask
 
18
  - text: "The patient was diagnosed with [MASK] and started on antibiotics."
19
  - text: "Mitochondria is the powerhouse of the [MASK]."
20
  model-index:
21
+ - name: ModernBERT-bio-base
22
  results:
23
  - task:
24
  type: token-classification
 
94
  value: 83.2
95
  ---
96
 
97
+ # ModernBERT-bio-base
98
 
99
+ *ModernBERT-bio is available in two sizes: [base](https://huggingface.co/almanach/ModernBERT-bio-base) (149M parameters) and [large](https://huggingface.co/almanach/ModernBERT-bio-large) (396M parameters). Our code is available in our [GitHub repository](https://github.com/Rian-T/colm2026-clm-detour).*
100
 
101
  ## Table of Contents
102
 
 
109
 
110
  ## Model Summary
111
 
112
+ ModernBERT-bio is an English biomedical encoder built by continued pretraining of [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base) using a **CLM detour** recipe. Instead of standard MLM continued pretraining, we temporarily switch to causal language modeling (CLM) before returning to MLM. This produces lasting representational changes in early transformer layers that improve downstream biomedical performance.
113
 
114
+ ModernBERT-bio achieves **78.0% average F1** across 11 English biomedical benchmarks (5 Clinical + 6 BigBIO), the highest balanced score across both task families.
115
 
116
  | | |
117
  |---|---|
 
143
  ```python
144
  from transformers import AutoTokenizer, AutoModelForMaskedLM
145
 
146
+ model_id = "almanach/ModernBERT-bio-base"
147
  tokenizer = AutoTokenizer.from_pretrained(model_id)
148
  model = AutoModelForMaskedLM.from_pretrained(model_id)
149
 
 
162
  ```python
163
  from transformers import AutoTokenizer, AutoModel
164
 
165
+ model_id = "almanach/ModernBERT-bio-base"
166
  tokenizer = AutoTokenizer.from_pretrained(model_id)
167
  model = AutoModel.from_pretrained(model_id)
168
 
 
172
  # outputs.last_hidden_state: [batch, seq_len, 768]
173
  ```
174
 
175
+ **Note:** ModernBERT-bio does not use token type IDs. You can omit the `token_type_ids` parameter.
176
 
177
  ## Training
178
 
 
187
 
188
  ### Methodology
189
 
190
+ ModernBERT-bio-base is trained in two phases, initialized from [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base):
191
 
192
  * **Phase 1 — CLM detour (50B tokens):** The bidirectional attention mask is replaced with a causal mask, and the model is trained with next-token prediction. This dense training signal (100% of positions) deeply modifies early transformer layers for domain adaptation.
193
  * **Phase 2 — MLM decay (5B tokens):** Bidirectional attention is restored, and the model is trained with masked language modeling at 15% masking. The learning rate decays from peak to 10% following a 1-sqrt schedule.
 
206
 
207
  | Model | Ctx | ChemProt | Phenotype | COS | Social Hist. | DEID | **Avg** |
208
  |-------|-----|----------|-----------|-----|-------------|------|---------|
209
+ | **ModernBERT-bio-base** | 8192 | 90.1 | **61.9** | **95.2** | 54.2 | **83.2** | **76.9** |
210
  | BioClinical-ModernBERT-base | 8192 | 90.0 | 60.7 | 94.8 | **56.0** | 81.8 | 76.7 |
211
  | PubMedBERT | 512 | **90.2** | 52.0 | 95.0 | 48.7 | 80.4 | 73.3 |
212
  | ModernBERT-base | 8192 | 89.5 | 48.4 | 94.0 | 53.1 | 78.3 | 72.7 |
 
215
 
216
  | Model | Ctx | AnatEM | BC5CDR | JNLPBA | NCBI | GAD | HoC | **Avg** |
217
  |-------|-----|--------|--------|--------|------|-----|-----|---------|
218
+ | **ModernBERT-bio-base** | 8192 | 81.0 | **89.1** | 74.5 | 80.1 | 78.8 | **70.0** | **78.9** |
219
  | BioClinical-ModernBERT-base | 8192 | 79.2 | 88.7 | 74.8 | 78.7 | 75.8 | 67.0 | 77.4 |
220
  | PubMedBERT | 512 | **83.3** | 89.7 | **74.9** | **82.1** | **79.3** | 71.0 | 80.1 |
221
  | ModernBERT-base | 8192 | 77.2 | 87.9 | 74.3 | 77.7 | 76.8 | 66.6 | 76.8 |
 
224
 
225
  | Model | Clinical | BigBIO | **Overall** |
226
  |-------|----------|--------|-------------|
227
+ | **ModernBERT-bio-base** | **76.9** | **78.9** | **78.0** |
228
  | BioClinical-ModernBERT-base | 76.7 | 77.4 | 77.0 |
229
  | PubMedBERT | 73.3 | 80.1 | 77.0 |
230
  | ModernBERT-base | 72.7 | 76.8 | 74.9 |
231
 
232
+ ModernBERT-bio-base achieves the highest balanced score (78.0%) across both Clinical and BigBIO task families. PubMedBERT scores higher on short-context BigBIO NER tasks but falls behind on long-context tasks (Phenotype: 52.0% vs 61.9%).
233
 
234
  ## Intended Use
235
 
 
245
 
246
  | Model | Language | Parameters |
247
  |-------|----------|------------|
248
+ | [ModernBERT-bio-base](https://huggingface.co/almanach/ModernBERT-bio-base) | English | 149M |
249
+ | [ModernBERT-bio-large](https://huggingface.co/almanach/ModernBERT-bio-large) | English | 396M |
250
+ | [ModernCamemBERT-bio-base](https://huggingface.co/almanach/ModernCamemBERT-bio-base) | French | 150M |
251
+ | [ModernCamemBERT-bio-large](https://huggingface.co/almanach/ModernCamemBERT-bio-large) | French | 350M |
252
 
253
  ## Limitations
254
 
255
+ - Trained on English biomedical text; not suitable for other languages without further adaptation. See [ModernCamemBERT-bio](https://huggingface.co/almanach/ModernCamemBERT-bio-base) for French.
256
  - Encoder model: produces contextualized representations, does not generate text.
257
  - Clinical text may contain sensitive patterns; users are responsible for compliance with applicable regulations (HIPAA, etc.).
258
  - The English CLM-MLM improvement (+0.3pp at Base scale) is smaller than in French (+2.9pp) and not statistically significant at Base scale (binomial p=0.27). The practical benefit is clearest at Large scale (+0.8pp) and on long-context tasks.
 
264
  ## Citation
265
 
266
  ```bibtex
267
+ @inproceedings{touchent2026clm,
268
  title={A Causal Language Modeling Detour Improves Encoder Continued Pretraining},
269
+ author={Touchent, Rian and de la Clergerie, {\'E}ric},
270
  booktitle={Proceedings of COLM},
271
  year={2026}
272
  }
 
274
 
275
  ## Acknowledgments
276
 
277
+ This work was performed using HPC resources from GENCI-IDRIS (Grant 2024-AD011015883).