Fixes: citation, grant, training time, eval table bolds, terminology cleanup
Browse files
README.md
CHANGED
|
@@ -189,14 +189,14 @@ outputs = model(**inputs)
|
|
| 189 |
|
| 190 |
ModernBERT-bio-base is trained in two phases, initialized from [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base):
|
| 191 |
|
| 192 |
-
* **Phase 1
|
| 193 |
-
* **Phase 2
|
| 194 |
|
| 195 |
-
Both phases use the same data mix (55B tokens total). Training used AdamW (lr=2e-4, beta1=0.9, beta2=0.98), bf16 mixed precision, global batch size of 384 sequences (~3.1M tokens), on 4× H100 80GB GPUs with [Composer](https://github.com/mosaicml/composer).
|
| 196 |
|
| 197 |
### Why a CLM Detour?
|
| 198 |
|
| 199 |
-
CLM supervises every token position, producing dense gradient updates that deeply modify early transformer layers (layers 0-7). These changes persist through the MLM decay phase
|
| 200 |
|
| 201 |
## Evaluation
|
| 202 |
|
|
@@ -255,7 +255,7 @@ The 8,192-token context is important for long clinical documents (discharge summ
|
|
| 255 |
- Trained on English biomedical text; not suitable for other languages without further adaptation. See [ModernCamemBERT-bio](https://huggingface.co/almanach/ModernCamemBERT-bio-base) for French.
|
| 256 |
- Encoder model: produces contextualized representations, does not generate text.
|
| 257 |
- Clinical text may contain sensitive patterns; users are responsible for compliance with applicable regulations (HIPAA, etc.).
|
| 258 |
-
- The English CLM-MLM improvement (+0.3pp at Base scale) is smaller than in French (+2.
|
| 259 |
|
| 260 |
## License
|
| 261 |
|
|
@@ -264,14 +264,14 @@ Apache 2.0
|
|
| 264 |
## Citation
|
| 265 |
|
| 266 |
```bibtex
|
| 267 |
-
@
|
| 268 |
title={A Causal Language Modeling Detour Improves Encoder Continued Pretraining},
|
| 269 |
author={Touchent, Rian and de la Clergerie, {\'E}ric},
|
| 270 |
-
|
| 271 |
-
|
| 272 |
}
|
| 273 |
```
|
| 274 |
|
| 275 |
## Acknowledgments
|
| 276 |
|
| 277 |
-
This work was performed using HPC resources from GENCI-IDRIS (Grant 2024-
|
|
|
|
| 189 |
|
| 190 |
ModernBERT-bio-base is trained in two phases, initialized from [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base):
|
| 191 |
|
| 192 |
+
* **Phase 1 (CLM detour, 50B tokens):** The bidirectional attention mask is replaced with a causal mask, and the model is trained with next-token prediction. This dense training signal (100% of positions) deeply modifies early transformer layers for domain adaptation.
|
| 193 |
+
* **Phase 2 (MLM decay, 5B tokens):** Bidirectional attention is restored, and the model is trained with masked language modeling at 15% masking. The learning rate decays from peak to 10% following a 1-sqrt schedule.
|
| 194 |
|
| 195 |
+
Both phases use the same data mix (55B tokens total). Training used AdamW (lr=2e-4, beta1=0.9, beta2=0.98), bf16 mixed precision, global batch size of 384 sequences (~3.1M tokens), on 4× H100 80GB GPUs with [Composer](https://github.com/mosaicml/composer).
|
| 196 |
|
| 197 |
### Why a CLM Detour?
|
| 198 |
|
| 199 |
+
CLM supervises every token position, producing dense gradient updates that deeply modify early transformer layers (layers 0-7). These changes persist through the MLM decay phase, even when the decay matches the CLM phase in length. We provide causal evidence through freeze interventions showing that early-layer modification is both necessary and sufficient for the CLM benefit (double dissociation). See our paper for the full mechanistic analysis.
|
| 200 |
|
| 201 |
## Evaluation
|
| 202 |
|
|
|
|
| 255 |
- Trained on English biomedical text; not suitable for other languages without further adaptation. See [ModernCamemBERT-bio](https://huggingface.co/almanach/ModernCamemBERT-bio-base) for French.
|
| 256 |
- Encoder model: produces contextualized representations, does not generate text.
|
| 257 |
- Clinical text may contain sensitive patterns; users are responsible for compliance with applicable regulations (HIPAA, etc.).
|
| 258 |
+
- The English CLM-MLM improvement (+0.3pp at Base scale) is smaller than in French (+2.8pp) and not statistically significant at Base scale (binomial p=0.27). The practical benefit is clearest at Large scale (+0.8pp) and on long-context tasks.
|
| 259 |
|
| 260 |
## License
|
| 261 |
|
|
|
|
| 264 |
## Citation
|
| 265 |
|
| 266 |
```bibtex
|
| 267 |
+
@article{touchent2026clmdetour,
|
| 268 |
title={A Causal Language Modeling Detour Improves Encoder Continued Pretraining},
|
| 269 |
author={Touchent, Rian and de la Clergerie, {\'E}ric},
|
| 270 |
+
year={2026},
|
| 271 |
+
journal={arXiv preprint}
|
| 272 |
}
|
| 273 |
```
|
| 274 |
|
| 275 |
## Acknowledgments
|
| 276 |
|
| 277 |
+
This work was performed using HPC resources from GENCI-IDRIS (Grant 2024-AD011014393R2).
|