Fill-Mask
Transformers
Safetensors
English
modernbert
biomedical
clinical
encoder
Eval Results (legacy)
Instructions to use almanach/ModernBERT-bio-large with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use almanach/ModernBERT-bio-large with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="almanach/ModernBERT-bio-large")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("almanach/ModernBERT-bio-large") model = AutoModelForMaskedLM.from_pretrained("almanach/ModernBERT-bio-large") - Notebooks
- Google Colab
- Kaggle
Update README: state-of-the-art biomedical encoder release
Browse files
README.md
CHANGED
|
@@ -10,7 +10,7 @@ tags:
|
|
| 10 |
- modernbert
|
| 11 |
- fill-mask
|
| 12 |
datasets:
|
| 13 |
-
-
|
| 14 |
base_model:
|
| 15 |
- answerdotai/ModernBERT-large
|
| 16 |
pipeline_tag: fill-mask
|
|
@@ -18,7 +18,7 @@ widget:
|
|
| 18 |
- text: "The patient was diagnosed with [MASK] and started on antibiotics."
|
| 19 |
- text: "Mitochondria is the powerhouse of the [MASK]."
|
| 20 |
model-index:
|
| 21 |
-
- name:
|
| 22 |
results:
|
| 23 |
- task:
|
| 24 |
type: token-classification
|
|
@@ -94,9 +94,9 @@ model-index:
|
|
| 94 |
value: 84.2
|
| 95 |
---
|
| 96 |
|
| 97 |
-
#
|
| 98 |
|
| 99 |
-
*
|
| 100 |
|
| 101 |
## Table of Contents
|
| 102 |
|
|
@@ -109,9 +109,9 @@ model-index:
|
|
| 109 |
|
| 110 |
## Model Summary
|
| 111 |
|
| 112 |
-
|
| 113 |
|
| 114 |
-
|
| 115 |
|
| 116 |
| | |
|
| 117 |
|---|---|
|
|
@@ -143,7 +143,7 @@ pip install flash-attn
|
|
| 143 |
```python
|
| 144 |
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
| 145 |
|
| 146 |
-
model_id = "
|
| 147 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 148 |
model = AutoModelForMaskedLM.from_pretrained(model_id)
|
| 149 |
|
|
@@ -162,7 +162,7 @@ print("Predicted token:", predicted_token)
|
|
| 162 |
```python
|
| 163 |
from transformers import AutoTokenizer, AutoModel
|
| 164 |
|
| 165 |
-
model_id = "
|
| 166 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 167 |
model = AutoModel.from_pretrained(model_id)
|
| 168 |
|
|
@@ -172,7 +172,7 @@ outputs = model(**inputs)
|
|
| 172 |
# outputs.last_hidden_state: [batch, seq_len, 1024]
|
| 173 |
```
|
| 174 |
|
| 175 |
-
**Note:**
|
| 176 |
|
| 177 |
## Training
|
| 178 |
|
|
@@ -187,7 +187,7 @@ outputs = model(**inputs)
|
|
| 187 |
|
| 188 |
### Methodology
|
| 189 |
|
| 190 |
-
|
| 191 |
|
| 192 |
* **Phase 1 — CLM detour (50B tokens):** The bidirectional attention mask is replaced with a causal mask, and the model is trained with next-token prediction. This dense training signal (100% of positions) deeply modifies early transformer layers for domain adaptation.
|
| 193 |
* **Phase 2 — MLM decay (5B tokens):** Bidirectional attention is restored, and the model is trained with masked language modeling at 15% masking. The learning rate decays from peak to 10% following a 1-sqrt schedule.
|
|
@@ -206,7 +206,7 @@ English biomedical benchmark results (11 tasks, 5 seeds per model):
|
|
| 206 |
|
| 207 |
| Model | Ctx | ChemProt | Phenotype | COS | Social Hist. | DEID | **Avg** |
|
| 208 |
|-------|-----|----------|-----------|-----|-------------|------|---------|
|
| 209 |
-
| **
|
| 210 |
| MLM baseline Large (ours) | 8192 | **90.5** | 61.0 | 94.9 | 55.0 | 82.3 | 76.7 |
|
| 211 |
| BioClinical-ModernBERT-base | 8192 | 90.0 | 60.7 | 94.8 | 56.0 | 81.8 | 76.7 |
|
| 212 |
| PubMedBERT | 512 | 90.2 | 52.0 | **95.0** | 48.7 | 80.4 | 73.3 |
|
|
@@ -215,7 +215,7 @@ English biomedical benchmark results (11 tasks, 5 seeds per model):
|
|
| 215 |
|
| 216 |
| Model | Ctx | AnatEM | BC5CDR | JNLPBA | NCBI | GAD | HoC | **Avg** |
|
| 217 |
|-------|-----|--------|--------|--------|------|-----|-----|---------|
|
| 218 |
-
| **
|
| 219 |
| MLM baseline Large (ours) | 8192 | 82.0 | 89.4 | **75.5** | **81.8** | 76.4 | 67.8 | 78.8 |
|
| 220 |
| BioClinical-ModernBERT-base | 8192 | 79.2 | 88.7 | 74.8 | 78.7 | 75.8 | 67.0 | 77.4 |
|
| 221 |
| PubMedBERT | 512 | 83.3 | 89.7 | 74.9 | 82.1 | 79.3 | **71.0** | 80.1 |
|
|
@@ -224,13 +224,13 @@ English biomedical benchmark results (11 tasks, 5 seeds per model):
|
|
| 224 |
|
| 225 |
| Model | Clinical | BigBIO | **Overall** |
|
| 226 |
|-------|----------|--------|-------------|
|
| 227 |
-
| **
|
| 228 |
| MLM baseline Large (ours) | 76.7 | 78.8 | 77.9 |
|
| 229 |
-
|
|
| 230 |
| BioClinical-ModernBERT-base | 76.7 | 77.4 | 77.0 |
|
| 231 |
| PubMedBERT | 73.3 | 80.1 | 77.0 |
|
| 232 |
|
| 233 |
-
|
| 234 |
|
| 235 |
## Intended Use
|
| 236 |
|
|
@@ -246,14 +246,14 @@ The 8,192-token context is important for long clinical documents. The Large size
|
|
| 246 |
|
| 247 |
| Model | Language | Parameters |
|
| 248 |
|-------|----------|------------|
|
| 249 |
-
| [
|
| 250 |
-
| [
|
| 251 |
-
| [
|
| 252 |
-
| [
|
| 253 |
|
| 254 |
## Limitations
|
| 255 |
|
| 256 |
-
- Trained on English biomedical text; not suitable for other languages without further adaptation. See [
|
| 257 |
- Encoder model: produces contextualized representations, does not generate text.
|
| 258 |
- Clinical text may contain sensitive patterns; users are responsible for compliance with applicable regulations (HIPAA, etc.).
|
| 259 |
- Training data includes MIMIC clinical notes, which are de-identified but derived from real patient records.
|
|
@@ -265,9 +265,9 @@ Apache 2.0
|
|
| 265 |
## Citation
|
| 266 |
|
| 267 |
```bibtex
|
| 268 |
-
@inproceedings{
|
| 269 |
title={A Causal Language Modeling Detour Improves Encoder Continued Pretraining},
|
| 270 |
-
author={
|
| 271 |
booktitle={Proceedings of COLM},
|
| 272 |
year={2026}
|
| 273 |
}
|
|
@@ -275,4 +275,4 @@ Apache 2.0
|
|
| 275 |
|
| 276 |
## Acknowledgments
|
| 277 |
|
| 278 |
-
This work was performed using HPC resources.
|
|
|
|
| 10 |
- modernbert
|
| 11 |
- fill-mask
|
| 12 |
datasets:
|
| 13 |
+
- almanach/Biomed-Enriched
|
| 14 |
base_model:
|
| 15 |
- answerdotai/ModernBERT-large
|
| 16 |
pipeline_tag: fill-mask
|
|
|
|
| 18 |
- text: "The patient was diagnosed with [MASK] and started on antibiotics."
|
| 19 |
- text: "Mitochondria is the powerhouse of the [MASK]."
|
| 20 |
model-index:
|
| 21 |
+
- name: ModernBERT-bio-large
|
| 22 |
results:
|
| 23 |
- task:
|
| 24 |
type: token-classification
|
|
|
|
| 94 |
value: 84.2
|
| 95 |
---
|
| 96 |
|
| 97 |
+
# ModernBERT-bio-large
|
| 98 |
|
| 99 |
+
*ModernBERT-bio is available in two sizes: [base](https://huggingface.co/almanach/ModernBERT-bio-base) (149M parameters) and [large](https://huggingface.co/almanach/ModernBERT-bio-large) (396M parameters). Our code is available in our [GitHub repository](https://github.com/Rian-T/colm2026-clm-detour).*
|
| 100 |
|
| 101 |
## Table of Contents
|
| 102 |
|
|
|
|
| 109 |
|
| 110 |
## Model Summary
|
| 111 |
|
| 112 |
+
ModernBERT-bio-large is the Large variant of our English biomedical encoder, built by continued pretraining of [ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large) using a **CLM detour** recipe. Instead of standard MLM continued pretraining, we temporarily switch to causal language modeling (CLM) before returning to MLM.
|
| 113 |
|
| 114 |
+
ModernBERT-bio-large achieves **78.7% average F1** across 11 English biomedical benchmarks, the highest overall score, outperforming both the MLM baseline (+0.8pp, 7/11 task wins) and all other models.
|
| 115 |
|
| 116 |
| | |
|
| 117 |
|---|---|
|
|
|
|
| 143 |
```python
|
| 144 |
from transformers import AutoTokenizer, AutoModelForMaskedLM
|
| 145 |
|
| 146 |
+
model_id = "almanach/ModernBERT-bio-large"
|
| 147 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 148 |
model = AutoModelForMaskedLM.from_pretrained(model_id)
|
| 149 |
|
|
|
|
| 162 |
```python
|
| 163 |
from transformers import AutoTokenizer, AutoModel
|
| 164 |
|
| 165 |
+
model_id = "almanach/ModernBERT-bio-large"
|
| 166 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 167 |
model = AutoModel.from_pretrained(model_id)
|
| 168 |
|
|
|
|
| 172 |
# outputs.last_hidden_state: [batch, seq_len, 1024]
|
| 173 |
```
|
| 174 |
|
| 175 |
+
**Note:** ModernBERT-bio does not use token type IDs. You can omit the `token_type_ids` parameter.
|
| 176 |
|
| 177 |
## Training
|
| 178 |
|
|
|
|
| 187 |
|
| 188 |
### Methodology
|
| 189 |
|
| 190 |
+
ModernBERT-bio-large is trained in two phases, initialized from [ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large):
|
| 191 |
|
| 192 |
* **Phase 1 — CLM detour (50B tokens):** The bidirectional attention mask is replaced with a causal mask, and the model is trained with next-token prediction. This dense training signal (100% of positions) deeply modifies early transformer layers for domain adaptation.
|
| 193 |
* **Phase 2 — MLM decay (5B tokens):** Bidirectional attention is restored, and the model is trained with masked language modeling at 15% masking. The learning rate decays from peak to 10% following a 1-sqrt schedule.
|
|
|
|
| 206 |
|
| 207 |
| Model | Ctx | ChemProt | Phenotype | COS | Social Hist. | DEID | **Avg** |
|
| 208 |
|-------|-----|----------|-----------|-----|-------------|------|---------|
|
| 209 |
+
| **ModernBERT-bio-large** | 8192 | 90.4 | 61.3 | 94.7 | **56.5** | **84.2** | **77.4** |
|
| 210 |
| MLM baseline Large (ours) | 8192 | **90.5** | 61.0 | 94.9 | 55.0 | 82.3 | 76.7 |
|
| 211 |
| BioClinical-ModernBERT-base | 8192 | 90.0 | 60.7 | 94.8 | 56.0 | 81.8 | 76.7 |
|
| 212 |
| PubMedBERT | 512 | 90.2 | 52.0 | **95.0** | 48.7 | 80.4 | 73.3 |
|
|
|
|
| 215 |
|
| 216 |
| Model | Ctx | AnatEM | BC5CDR | JNLPBA | NCBI | GAD | HoC | **Avg** |
|
| 217 |
|-------|-----|--------|--------|--------|------|-----|-----|---------|
|
| 218 |
+
| **ModernBERT-bio-large** | 8192 | **83.2** | **89.8** | 75.3 | 81.7 | **79.7** | 69.3 | **79.8** |
|
| 219 |
| MLM baseline Large (ours) | 8192 | 82.0 | 89.4 | **75.5** | **81.8** | 76.4 | 67.8 | 78.8 |
|
| 220 |
| BioClinical-ModernBERT-base | 8192 | 79.2 | 88.7 | 74.8 | 78.7 | 75.8 | 67.0 | 77.4 |
|
| 221 |
| PubMedBERT | 512 | 83.3 | 89.7 | 74.9 | 82.1 | 79.3 | **71.0** | 80.1 |
|
|
|
|
| 224 |
|
| 225 |
| Model | Clinical | BigBIO | **Overall** |
|
| 226 |
|-------|----------|--------|-------------|
|
| 227 |
+
| **ModernBERT-bio-large** | **77.4** | **79.8** | **78.7** |
|
| 228 |
| MLM baseline Large (ours) | 76.7 | 78.8 | 77.9 |
|
| 229 |
+
| ModernBERT-bio-base | 76.9 | 78.9 | 78.0 |
|
| 230 |
| BioClinical-ModernBERT-base | 76.7 | 77.4 | 77.0 |
|
| 231 |
| PubMedBERT | 73.3 | 80.1 | 77.0 |
|
| 232 |
|
| 233 |
+
ModernBERT-bio-large achieves the highest overall score (78.7%), with the CLM benefit widening at Large scale (+0.8pp vs +0.3pp for Base). The model sets new state-of-the-art on DEID (84.2%), BC5CDR (89.8%), GAD (79.7%), and Social History (56.5%).
|
| 234 |
|
| 235 |
## Intended Use
|
| 236 |
|
|
|
|
| 246 |
|
| 247 |
| Model | Language | Parameters |
|
| 248 |
|-------|----------|------------|
|
| 249 |
+
| [ModernBERT-bio-base](https://huggingface.co/almanach/ModernBERT-bio-base) | English | 149M |
|
| 250 |
+
| [ModernBERT-bio-large](https://huggingface.co/almanach/ModernBERT-bio-large) | English | 396M |
|
| 251 |
+
| [ModernCamemBERT-bio-base](https://huggingface.co/almanach/ModernCamemBERT-bio-base) | French | 150M |
|
| 252 |
+
| [ModernCamemBERT-bio-large](https://huggingface.co/almanach/ModernCamemBERT-bio-large) | French | 350M |
|
| 253 |
|
| 254 |
## Limitations
|
| 255 |
|
| 256 |
+
- Trained on English biomedical text; not suitable for other languages without further adaptation. See [ModernCamemBERT-bio](https://huggingface.co/almanach/ModernCamemBERT-bio-base) for French.
|
| 257 |
- Encoder model: produces contextualized representations, does not generate text.
|
| 258 |
- Clinical text may contain sensitive patterns; users are responsible for compliance with applicable regulations (HIPAA, etc.).
|
| 259 |
- Training data includes MIMIC clinical notes, which are de-identified but derived from real patient records.
|
|
|
|
| 265 |
## Citation
|
| 266 |
|
| 267 |
```bibtex
|
| 268 |
+
@inproceedings{touchent2026clm,
|
| 269 |
title={A Causal Language Modeling Detour Improves Encoder Continued Pretraining},
|
| 270 |
+
author={Touchent, Rian and de la Clergerie, {\'E}ric},
|
| 271 |
booktitle={Proceedings of COLM},
|
| 272 |
year={2026}
|
| 273 |
}
|
|
|
|
| 275 |
|
| 276 |
## Acknowledgments
|
| 277 |
|
| 278 |
+
This work was performed using HPC resources from GENCI-IDRIS (Grant 2024-AD011015883).
|