almanach
/

ModernBERT-bio-large

@@ -10,7 +10,7 @@ tags:
 - modernbert
 - fill-mask
 datasets:
-- rntc/biomed-enriched
 base_model:
 - answerdotai/ModernBERT-large
 pipeline_tag: fill-mask
@@ -18,7 +18,7 @@ widget:
 - text: "The patient was diagnosed with [MASK] and started on antibiotics."
 - text: "Mitochondria is the powerhouse of the [MASK]."
 model-index:
-- name: cpt-en-large
   results:
   - task:
       type: token-classification
@@ -94,9 +94,9 @@ model-index:
       value: 84.2
 ---
-# cpt-en-large
-*cpt-en is available in two sizes: [base](https://huggingface.co/rntc/cpt-en-base) (149M parameters) and [large](https://huggingface.co/rntc/cpt-en-large) (396M parameters). Our code will be released upon publication.*
 ## Table of Contents
@@ -109,9 +109,9 @@ model-index:
 ## Model Summary
-cpt-en-large is the Large variant of our English biomedical encoder, built by continued pretraining of [ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large) using a **CLM detour** recipe. Instead of standard MLM continued pretraining, we temporarily switch to causal language modeling (CLM) before returning to MLM.
-cpt-en-large achieves **78.7% average F1** across 11 English biomedical benchmarks, the highest overall score, outperforming both the MLM baseline (+0.8pp, 7/11 task wins) and all other models.
 | | |
 |---|---|
@@ -143,7 +143,7 @@ pip install flash-attn
 ```python
 from transformers import AutoTokenizer, AutoModelForMaskedLM
-model_id = "rntc/cpt-en-large"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 model = AutoModelForMaskedLM.from_pretrained(model_id)
@@ -162,7 +162,7 @@ print("Predicted token:", predicted_token)
 ```python
 from transformers import AutoTokenizer, AutoModel
-model_id = "rntc/cpt-en-large"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 model = AutoModel.from_pretrained(model_id)
@@ -172,7 +172,7 @@ outputs = model(**inputs)
 # outputs.last_hidden_state: [batch, seq_len, 1024]
 ```
-**Note:** cpt-en does not use token type IDs. You can omit the `token_type_ids` parameter.
 ## Training
@@ -187,7 +187,7 @@ outputs = model(**inputs)
 ### Methodology
-cpt-en-large is trained in two phases, initialized from [ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large):
 * **Phase 1 — CLM detour (50B tokens):** The bidirectional attention mask is replaced with a causal mask, and the model is trained with next-token prediction. This dense training signal (100% of positions) deeply modifies early transformer layers for domain adaptation.
 * **Phase 2 — MLM decay (5B tokens):** Bidirectional attention is restored, and the model is trained with masked language modeling at 15% masking. The learning rate decays from peak to 10% following a 1-sqrt schedule.
@@ -206,7 +206,7 @@ English biomedical benchmark results (11 tasks, 5 seeds per model):
 | Model | Ctx | ChemProt | Phenotype | COS | Social Hist. | DEID | **Avg** |
 |-------|-----|----------|-----------|-----|-------------|------|---------|
-| **cpt-en-large** | 8192 | 90.4 | 61.3 | 94.7 | **56.5** | **84.2** | **77.4** |
 | MLM baseline Large (ours) | 8192 | **90.5** | 61.0 | 94.9 | 55.0 | 82.3 | 76.7 |
 | BioClinical-ModernBERT-base | 8192 | 90.0 | 60.7 | 94.8 | 56.0 | 81.8 | 76.7 |
 | PubMedBERT | 512 | 90.2 | 52.0 | **95.0** | 48.7 | 80.4 | 73.3 |
@@ -215,7 +215,7 @@ English biomedical benchmark results (11 tasks, 5 seeds per model):
 | Model | Ctx | AnatEM | BC5CDR | JNLPBA | NCBI | GAD | HoC | **Avg** |
 |-------|-----|--------|--------|--------|------|-----|-----|---------|
-| **cpt-en-large** | 8192 | **83.2** | **89.8** | 75.3 | 81.7 | **79.7** | 69.3 | **79.8** |
 | MLM baseline Large (ours) | 8192 | 82.0 | 89.4 | **75.5** | **81.8** | 76.4 | 67.8 | 78.8 |
 | BioClinical-ModernBERT-base | 8192 | 79.2 | 88.7 | 74.8 | 78.7 | 75.8 | 67.0 | 77.4 |
 | PubMedBERT | 512 | 83.3 | 89.7 | 74.9 | 82.1 | 79.3 | **71.0** | 80.1 |
@@ -224,13 +224,13 @@ English biomedical benchmark results (11 tasks, 5 seeds per model):
 | Model | Clinical | BigBIO | **Overall** |
 |-------|----------|--------|-------------|
-| **cpt-en-large** | **77.4** | **79.8** | **78.7** |
 | MLM baseline Large (ours) | 76.7 | 78.8 | 77.9 |
-| cpt-en-base | 76.9 | 78.9 | 78.0 |
 | BioClinical-ModernBERT-base | 76.7 | 77.4 | 77.0 |
 | PubMedBERT | 73.3 | 80.1 | 77.0 |
-cpt-en-large achieves the highest overall score (78.7%), with the CLM benefit widening at Large scale (+0.8pp vs +0.3pp for Base). The model sets new state-of-the-art on DEID (84.2%), AnatEM (83.2%), and GAD (79.7%).
 ## Intended Use
@@ -246,14 +246,14 @@ The 8,192-token context is important for long clinical documents. The Large size
 | Model | Language | Parameters |
 |-------|----------|------------|
-| [cpt-en-base](https://huggingface.co/rntc/cpt-en-base) | English | 149M |
-| [cpt-en-large](https://huggingface.co/rntc/cpt-en-large) | English | 396M |
-| [cpt-fr-base](https://huggingface.co/rntc/cpt-fr-base) | French | 150M |
-| [cpt-fr-large](https://huggingface.co/rntc/cpt-fr-large) | French | 350M |
 ## Limitations
-- Trained on English biomedical text; not suitable for other languages without further adaptation. See [cpt-fr](https://huggingface.co/rntc/cpt-fr-base) for French.
 - Encoder model: produces contextualized representations, does not generate text.
 - Clinical text may contain sensitive patterns; users are responsible for compliance with applicable regulations (HIPAA, etc.).
 - Training data includes MIMIC clinical notes, which are de-identified but derived from real patient records.
@@ -265,9 +265,9 @@ Apache 2.0
 ## Citation
 ```bibtex
-@inproceedings{anonymous2026clm,
   title={A Causal Language Modeling Detour Improves Encoder Continued Pretraining},
-  author={Anonymous},
   booktitle={Proceedings of COLM},
   year={2026}
 }
@@ -275,4 +275,4 @@ Apache 2.0
 ## Acknowledgments
-This work was performed using HPC resources.

 - modernbert
 - fill-mask
 datasets:
+- almanach/Biomed-Enriched
 base_model:
 - answerdotai/ModernBERT-large
 pipeline_tag: fill-mask
 - text: "The patient was diagnosed with [MASK] and started on antibiotics."
 - text: "Mitochondria is the powerhouse of the [MASK]."
 model-index:
+- name: ModernBERT-bio-large
   results:
   - task:
       type: token-classification
       value: 84.2
 ---
+# ModernBERT-bio-large
+*ModernBERT-bio is available in two sizes: [base](https://huggingface.co/almanach/ModernBERT-bio-base) (149M parameters) and [large](https://huggingface.co/almanach/ModernBERT-bio-large) (396M parameters). Our code is available in our [GitHub repository](https://github.com/Rian-T/colm2026-clm-detour).*
 ## Table of Contents
 ## Model Summary
+ModernBERT-bio-large is the Large variant of our English biomedical encoder, built by continued pretraining of [ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large) using a **CLM detour** recipe. Instead of standard MLM continued pretraining, we temporarily switch to causal language modeling (CLM) before returning to MLM.
+ModernBERT-bio-large achieves **78.7% average F1** across 11 English biomedical benchmarks, the highest overall score, outperforming both the MLM baseline (+0.8pp, 7/11 task wins) and all other models.
 | | |
 |---|---|
 ```python
 from transformers import AutoTokenizer, AutoModelForMaskedLM
+model_id = "almanach/ModernBERT-bio-large"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 model = AutoModelForMaskedLM.from_pretrained(model_id)
 ```python
 from transformers import AutoTokenizer, AutoModel
+model_id = "almanach/ModernBERT-bio-large"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 model = AutoModel.from_pretrained(model_id)
 # outputs.last_hidden_state: [batch, seq_len, 1024]
 ```
+**Note:** ModernBERT-bio does not use token type IDs. You can omit the `token_type_ids` parameter.
 ## Training
 ### Methodology
+ModernBERT-bio-large is trained in two phases, initialized from [ModernBERT-large](https://huggingface.co/answerdotai/ModernBERT-large):
 * **Phase 1 — CLM detour (50B tokens):** The bidirectional attention mask is replaced with a causal mask, and the model is trained with next-token prediction. This dense training signal (100% of positions) deeply modifies early transformer layers for domain adaptation.
 * **Phase 2 — MLM decay (5B tokens):** Bidirectional attention is restored, and the model is trained with masked language modeling at 15% masking. The learning rate decays from peak to 10% following a 1-sqrt schedule.
 | Model | Ctx | ChemProt | Phenotype | COS | Social Hist. | DEID | **Avg** |
 |-------|-----|----------|-----------|-----|-------------|------|---------|
+| **ModernBERT-bio-large** | 8192 | 90.4 | 61.3 | 94.7 | **56.5** | **84.2** | **77.4** |
 | MLM baseline Large (ours) | 8192 | **90.5** | 61.0 | 94.9 | 55.0 | 82.3 | 76.7 |
 | BioClinical-ModernBERT-base | 8192 | 90.0 | 60.7 | 94.8 | 56.0 | 81.8 | 76.7 |
 | PubMedBERT | 512 | 90.2 | 52.0 | **95.0** | 48.7 | 80.4 | 73.3 |
 | Model | Ctx | AnatEM | BC5CDR | JNLPBA | NCBI | GAD | HoC | **Avg** |
 |-------|-----|--------|--------|--------|------|-----|-----|---------|
+| **ModernBERT-bio-large** | 8192 | **83.2** | **89.8** | 75.3 | 81.7 | **79.7** | 69.3 | **79.8** |
 | MLM baseline Large (ours) | 8192 | 82.0 | 89.4 | **75.5** | **81.8** | 76.4 | 67.8 | 78.8 |
 | BioClinical-ModernBERT-base | 8192 | 79.2 | 88.7 | 74.8 | 78.7 | 75.8 | 67.0 | 77.4 |
 | PubMedBERT | 512 | 83.3 | 89.7 | 74.9 | 82.1 | 79.3 | **71.0** | 80.1 |
 | Model | Clinical | BigBIO | **Overall** |
 |-------|----------|--------|-------------|
+| **ModernBERT-bio-large** | **77.4** | **79.8** | **78.7** |
 | MLM baseline Large (ours) | 76.7 | 78.8 | 77.9 |
+| ModernBERT-bio-base | 76.9 | 78.9 | 78.0 |
 | BioClinical-ModernBERT-base | 76.7 | 77.4 | 77.0 |
 | PubMedBERT | 73.3 | 80.1 | 77.0 |
+ModernBERT-bio-large achieves the highest overall score (78.7%), with the CLM benefit widening at Large scale (+0.8pp vs +0.3pp for Base). The model sets new state-of-the-art on DEID (84.2%), BC5CDR (89.8%), GAD (79.7%), and Social History (56.5%).
 ## Intended Use
 | Model | Language | Parameters |
 |-------|----------|------------|
+| [ModernBERT-bio-base](https://huggingface.co/almanach/ModernBERT-bio-base) | English | 149M |
+| [ModernBERT-bio-large](https://huggingface.co/almanach/ModernBERT-bio-large) | English | 396M |
+| [ModernCamemBERT-bio-base](https://huggingface.co/almanach/ModernCamemBERT-bio-base) | French | 150M |
+| [ModernCamemBERT-bio-large](https://huggingface.co/almanach/ModernCamemBERT-bio-large) | French | 350M |
 ## Limitations
+- Trained on English biomedical text; not suitable for other languages without further adaptation. See [ModernCamemBERT-bio](https://huggingface.co/almanach/ModernCamemBERT-bio-base) for French.
 - Encoder model: produces contextualized representations, does not generate text.
 - Clinical text may contain sensitive patterns; users are responsible for compliance with applicable regulations (HIPAA, etc.).
 - Training data includes MIMIC clinical notes, which are de-identified but derived from real patient records.
 ## Citation
 ```bibtex
+@inproceedings{touchent2026clm,
   title={A Causal Language Modeling Detour Improves Encoder Continued Pretraining},
+  author={Touchent, Rian and de la Clergerie, {\'E}ric},
   booktitle={Proceedings of COLM},
   year={2026}
 }
 ## Acknowledgments
+This work was performed using HPC resources from GENCI-IDRIS (Grant 2024-AD011015883).