almanach
/

ModernBERT-bio-base

@@ -189,14 +189,14 @@ outputs = model(**inputs)
 ModernBERT-bio-base is trained in two phases, initialized from [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base):
-* **Phase 1 — CLM detour (50B tokens):** The bidirectional attention mask is replaced with a causal mask, and the model is trained with next-token prediction. This dense training signal (100% of positions) deeply modifies early transformer layers for domain adaptation.
-* **Phase 2 — MLM decay (5B tokens):** Bidirectional attention is restored, and the model is trained with masked language modeling at 15% masking. The learning rate decays from peak to 10% following a 1-sqrt schedule.
-Both phases use the same data mix (55B tokens total). Training used AdamW (lr=2e-4, beta1=0.9, beta2=0.98), bf16 mixed precision, global batch size of 384 sequences (~3.1M tokens), on 4× H100 80GB GPUs with [Composer](https://github.com/mosaicml/composer). Total training time: ~5 GPU-hours.
 ### Why a CLM Detour?
-CLM supervises every token position, producing dense gradient updates that deeply modify early transformer layers (layers 0-7). These changes persist through the MLM decay phase — a phenomenon we call **computational hysteresis**. We provide causal evidence through freeze interventions showing that early-layer modification is both necessary and sufficient for the CLM benefit (double dissociation). See our paper for the full mechanistic analysis.
 ## Evaluation
@@ -255,7 +255,7 @@ The 8,192-token context is important for long clinical documents (discharge summ
 - Trained on English biomedical text; not suitable for other languages without further adaptation. See [ModernCamemBERT-bio](https://huggingface.co/almanach/ModernCamemBERT-bio-base) for French.
 - Encoder model: produces contextualized representations, does not generate text.
 - Clinical text may contain sensitive patterns; users are responsible for compliance with applicable regulations (HIPAA, etc.).
-- The English CLM-MLM improvement (+0.3pp at Base scale) is smaller than in French (+2.9pp) and not statistically significant at Base scale (binomial p=0.27). The practical benefit is clearest at Large scale (+0.8pp) and on long-context tasks.
 ## License
@@ -264,14 +264,14 @@ Apache 2.0
 ## Citation
 ```bibtex
-@inproceedings{touchent2026clm,
   title={A Causal Language Modeling Detour Improves Encoder Continued Pretraining},
   author={Touchent, Rian and de la Clergerie, {\'E}ric},
-  booktitle={Proceedings of COLM},
-  year={2026}
 }
 ```
 ## Acknowledgments
-This work was performed using HPC resources from GENCI-IDRIS (Grant 2024-AD011015883).

 ModernBERT-bio-base is trained in two phases, initialized from [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base):
+* **Phase 1 (CLM detour, 50B tokens):** The bidirectional attention mask is replaced with a causal mask, and the model is trained with next-token prediction. This dense training signal (100% of positions) deeply modifies early transformer layers for domain adaptation.
+* **Phase 2 (MLM decay, 5B tokens):** Bidirectional attention is restored, and the model is trained with masked language modeling at 15% masking. The learning rate decays from peak to 10% following a 1-sqrt schedule.
+Both phases use the same data mix (55B tokens total). Training used AdamW (lr=2e-4, beta1=0.9, beta2=0.98), bf16 mixed precision, global batch size of 384 sequences (~3.1M tokens), on 4× H100 80GB GPUs with [Composer](https://github.com/mosaicml/composer).
 ### Why a CLM Detour?
+CLM supervises every token position, producing dense gradient updates that deeply modify early transformer layers (layers 0-7). These changes persist through the MLM decay phase, even when the decay matches the CLM phase in length. We provide causal evidence through freeze interventions showing that early-layer modification is both necessary and sufficient for the CLM benefit (double dissociation). See our paper for the full mechanistic analysis.
 ## Evaluation
 - Trained on English biomedical text; not suitable for other languages without further adaptation. See [ModernCamemBERT-bio](https://huggingface.co/almanach/ModernCamemBERT-bio-base) for French.
 - Encoder model: produces contextualized representations, does not generate text.
 - Clinical text may contain sensitive patterns; users are responsible for compliance with applicable regulations (HIPAA, etc.).
+- The English CLM-MLM improvement (+0.3pp at Base scale) is smaller than in French (+2.8pp) and not statistically significant at Base scale (binomial p=0.27). The practical benefit is clearest at Large scale (+0.8pp) and on long-context tasks.
 ## License
 ## Citation
 ```bibtex
+@article{touchent2026clmdetour,
   title={A Causal Language Modeling Detour Improves Encoder Continued Pretraining},
   author={Touchent, Rian and de la Clergerie, {\'E}ric},
+  year={2026},
+  journal={arXiv preprint}
 }
 ```
 ## Acknowledgments
+This work was performed using HPC resources from GENCI-IDRIS (Grant 2024-AD011014393R2).