rntc commited on
Commit
4d7489b
·
verified ·
1 Parent(s): 449ec98

Fixes: citation, grant, training time, eval table bolds, terminology cleanup

Browse files
Files changed (1) hide show
  1. README.md +9 -9
README.md CHANGED
@@ -189,14 +189,14 @@ outputs = model(**inputs)
189
 
190
  ModernBERT-bio-base is trained in two phases, initialized from [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base):
191
 
192
- * **Phase 1 CLM detour (50B tokens):** The bidirectional attention mask is replaced with a causal mask, and the model is trained with next-token prediction. This dense training signal (100% of positions) deeply modifies early transformer layers for domain adaptation.
193
- * **Phase 2 MLM decay (5B tokens):** Bidirectional attention is restored, and the model is trained with masked language modeling at 15% masking. The learning rate decays from peak to 10% following a 1-sqrt schedule.
194
 
195
- Both phases use the same data mix (55B tokens total). Training used AdamW (lr=2e-4, beta1=0.9, beta2=0.98), bf16 mixed precision, global batch size of 384 sequences (~3.1M tokens), on 4× H100 80GB GPUs with [Composer](https://github.com/mosaicml/composer). Total training time: ~5 GPU-hours.
196
 
197
  ### Why a CLM Detour?
198
 
199
- CLM supervises every token position, producing dense gradient updates that deeply modify early transformer layers (layers 0-7). These changes persist through the MLM decay phase a phenomenon we call **computational hysteresis**. We provide causal evidence through freeze interventions showing that early-layer modification is both necessary and sufficient for the CLM benefit (double dissociation). See our paper for the full mechanistic analysis.
200
 
201
  ## Evaluation
202
 
@@ -255,7 +255,7 @@ The 8,192-token context is important for long clinical documents (discharge summ
255
  - Trained on English biomedical text; not suitable for other languages without further adaptation. See [ModernCamemBERT-bio](https://huggingface.co/almanach/ModernCamemBERT-bio-base) for French.
256
  - Encoder model: produces contextualized representations, does not generate text.
257
  - Clinical text may contain sensitive patterns; users are responsible for compliance with applicable regulations (HIPAA, etc.).
258
- - The English CLM-MLM improvement (+0.3pp at Base scale) is smaller than in French (+2.9pp) and not statistically significant at Base scale (binomial p=0.27). The practical benefit is clearest at Large scale (+0.8pp) and on long-context tasks.
259
 
260
  ## License
261
 
@@ -264,14 +264,14 @@ Apache 2.0
264
  ## Citation
265
 
266
  ```bibtex
267
- @inproceedings{touchent2026clm,
268
  title={A Causal Language Modeling Detour Improves Encoder Continued Pretraining},
269
  author={Touchent, Rian and de la Clergerie, {\'E}ric},
270
- booktitle={Proceedings of COLM},
271
- year={2026}
272
  }
273
  ```
274
 
275
  ## Acknowledgments
276
 
277
- This work was performed using HPC resources from GENCI-IDRIS (Grant 2024-AD011015883).
 
189
 
190
  ModernBERT-bio-base is trained in two phases, initialized from [ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base):
191
 
192
+ * **Phase 1 (CLM detour, 50B tokens):** The bidirectional attention mask is replaced with a causal mask, and the model is trained with next-token prediction. This dense training signal (100% of positions) deeply modifies early transformer layers for domain adaptation.
193
+ * **Phase 2 (MLM decay, 5B tokens):** Bidirectional attention is restored, and the model is trained with masked language modeling at 15% masking. The learning rate decays from peak to 10% following a 1-sqrt schedule.
194
 
195
+ Both phases use the same data mix (55B tokens total). Training used AdamW (lr=2e-4, beta1=0.9, beta2=0.98), bf16 mixed precision, global batch size of 384 sequences (~3.1M tokens), on 4× H100 80GB GPUs with [Composer](https://github.com/mosaicml/composer).
196
 
197
  ### Why a CLM Detour?
198
 
199
+ CLM supervises every token position, producing dense gradient updates that deeply modify early transformer layers (layers 0-7). These changes persist through the MLM decay phase, even when the decay matches the CLM phase in length. We provide causal evidence through freeze interventions showing that early-layer modification is both necessary and sufficient for the CLM benefit (double dissociation). See our paper for the full mechanistic analysis.
200
 
201
  ## Evaluation
202
 
 
255
  - Trained on English biomedical text; not suitable for other languages without further adaptation. See [ModernCamemBERT-bio](https://huggingface.co/almanach/ModernCamemBERT-bio-base) for French.
256
  - Encoder model: produces contextualized representations, does not generate text.
257
  - Clinical text may contain sensitive patterns; users are responsible for compliance with applicable regulations (HIPAA, etc.).
258
+ - The English CLM-MLM improvement (+0.3pp at Base scale) is smaller than in French (+2.8pp) and not statistically significant at Base scale (binomial p=0.27). The practical benefit is clearest at Large scale (+0.8pp) and on long-context tasks.
259
 
260
  ## License
261
 
 
264
  ## Citation
265
 
266
  ```bibtex
267
+ @article{touchent2026clmdetour,
268
  title={A Causal Language Modeling Detour Improves Encoder Continued Pretraining},
269
  author={Touchent, Rian and de la Clergerie, {\'E}ric},
270
+ year={2026},
271
+ journal={arXiv preprint}
272
  }
273
  ```
274
 
275
  ## Acknowledgments
276
 
277
+ This work was performed using HPC resources from GENCI-IDRIS (Grant 2024-AD011014393R2).