rntc commited on
Commit
e53f971
·
verified ·
1 Parent(s): 38eee8c

Update README: state-of-the-art biomedical encoder release

Browse files
Files changed (1) hide show
  1. README.md +23 -23
README.md CHANGED
@@ -18,7 +18,7 @@ widget:
18
  - text: "Les patients atteints de <mask> présentent un risque accru de complications cardiovasculaires."
19
  - text: "Le traitement par <mask> a montré une amélioration significative des symptômes."
20
  model-index:
21
- - name: cpt-fr-base
22
  results:
23
  - task:
24
  type: text-classification
@@ -82,7 +82,7 @@ model-index:
82
  type: emea
83
  metrics:
84
  - type: f1
85
- value: 65.9
86
  - task:
87
  type: token-classification
88
  name: NER
@@ -91,12 +91,12 @@ model-index:
91
  type: medline
92
  metrics:
93
  - type: f1
94
- value: 58.2
95
  ---
96
 
97
- # cpt-fr-base
98
 
99
- *cpt-fr is available in two sizes: [base](https://huggingface.co/rntc/cpt-fr-base) (150M parameters) and [large](https://huggingface.co/rntc/cpt-fr-large) (350M parameters). Our code will be released upon publication.*
100
 
101
  ## Table of Contents
102
 
@@ -109,7 +109,7 @@ model-index:
109
 
110
  ## Model Summary
111
 
112
- cpt-fr is a French biomedical encoder built by continued pretraining of [ModernCamemBERT](https://huggingface.co/almanach/moderncamembert-base) using a **CLM detour** recipe. Instead of standard MLM continued pretraining, we temporarily switch to causal language modeling (CLM) before returning to MLM. This produces lasting representational changes in early transformer layers that improve downstream biomedical performance by +2.9pp on average across 8 French biomedical tasks.
113
 
114
  The model uses the ModernBERT architecture with FlashAttention, rotary positional embeddings (RoPE), alternating local/global attention, and unpadding, supporting **8,192-token context** — critical for long clinical documents that exceed the 512-token limit of previous French biomedical models.
115
 
@@ -143,7 +143,7 @@ pip install flash-attn
143
  ```python
144
  from transformers import AutoTokenizer, AutoModelForMaskedLM
145
 
146
- model_id = "rntc/cpt-fr-base"
147
  tokenizer = AutoTokenizer.from_pretrained(model_id)
148
  model = AutoModelForMaskedLM.from_pretrained(model_id)
149
 
@@ -162,7 +162,7 @@ print("Predicted token:", predicted_token)
162
  ```python
163
  from transformers import AutoTokenizer, AutoModel
164
 
165
- model_id = "rntc/cpt-fr-base"
166
  tokenizer = AutoTokenizer.from_pretrained(model_id)
167
  model = AutoModel.from_pretrained(model_id)
168
 
@@ -172,7 +172,7 @@ outputs = model(**inputs)
172
  # outputs.last_hidden_state: [batch, seq_len, 768]
173
  ```
174
 
175
- **Note:** cpt-fr does not use token type IDs. You can omit the `token_type_ids` parameter.
176
 
177
  ## Training
178
 
@@ -188,7 +188,7 @@ outputs = model(**inputs)
188
 
189
  ### Methodology
190
 
191
- cpt-fr is trained in two phases, initialized from [ModernCamemBERT](https://huggingface.co/almanach/moderncamembert-base):
192
 
193
  * **Phase 1 — CLM detour (10B tokens):** The bidirectional attention mask is replaced with a causal mask, and the model is trained with next-token prediction. This dense training signal (100% of positions) deeply modifies early transformer layers for domain adaptation.
194
  * **Phase 2 — MLM decay (1B tokens):** Bidirectional attention is restored, and the model is trained with masked language modeling at 15% masking. The learning rate decays from peak to 10% following a 1-sqrt schedule.
@@ -205,13 +205,13 @@ French biomedical benchmark results (8 tasks, 9 seeds per model, macro-averaged
205
 
206
  | Model | Ctx | FrACCO-30 | FrACCO-100 | CANTEMIST | DISTEMIST | MedDialog | DiaMed | EMEA | Medline | **Avg** |
207
  |-------|-----|-----------|------------|-----------|-----------|-----------|--------|------|---------|---------|
208
- | **cpt-fr-base** | 8192 | **74.8** | **60.1** | **71.0** | **25.5** | 63.6 | **67.4** | 65.9 | 58.2 | **60.8** |
209
- | MLM baseline (ours) | 8192 | 69.9 | 56.8 | 64.9 | 23.5 | 62.5 | 63.4 | 65.4 | 56.8 | 57.9 |
210
- | ModernCamemBERT | 8192 | 70.1 | 55.3 | 63.3 | 20.2 | 60.6 | 56.4 | 63.4 | 55.3 | 55.6 |
211
- | DrBERT | 512 | 53.0 | 35.6 | 37.9 | 21.4 | 63.6 | 57.0 | **68.0** | **62.3** | 49.9 |
212
- | CamemBERT-bio | 512 | 41.9 | 20.1 | 12.8 | 9.6 | 38.6 | 47.7 | 61.6 | 56.6 | 36.1 |
213
 
214
- cpt-fr-base outperforms the matched MLM baseline on all 8 tasks (+2.9pp, binomial p=0.004).
215
 
216
  ## Intended Use
217
 
@@ -227,10 +227,10 @@ The 8,192-token context is critical for long clinical documents (discharge summa
227
 
228
  | Model | Language | Parameters |
229
  |-------|----------|------------|
230
- | [cpt-en-base](https://huggingface.co/rntc/cpt-en-base) | English | 149M |
231
- | [cpt-en-large](https://huggingface.co/rntc/cpt-en-large) | English | 396M |
232
- | [cpt-fr-base](https://huggingface.co/rntc/cpt-fr-base) | French | 150M |
233
- | [cpt-fr-large](https://huggingface.co/rntc/cpt-fr-large) | French | 350M |
234
 
235
  ## Limitations
236
 
@@ -245,9 +245,9 @@ Apache 2.0
245
  ## Citation
246
 
247
  ```bibtex
248
- @inproceedings{anonymous2026clm,
249
  title={A Causal Language Modeling Detour Improves Encoder Continued Pretraining},
250
- author={Anonymous},
251
  booktitle={Proceedings of COLM},
252
  year={2026}
253
  }
@@ -255,4 +255,4 @@ Apache 2.0
255
 
256
  ## Acknowledgments
257
 
258
- This work was performed using HPC resources.
 
18
  - text: "Les patients atteints de <mask> présentent un risque accru de complications cardiovasculaires."
19
  - text: "Le traitement par <mask> a montré une amélioration significative des symptômes."
20
  model-index:
21
+ - name: ModernCamemBERT-bio-base
22
  results:
23
  - task:
24
  type: text-classification
 
82
  type: emea
83
  metrics:
84
  - type: f1
85
+ value: 68.6
86
  - task:
87
  type: token-classification
88
  name: NER
 
91
  type: medline
92
  metrics:
93
  - type: f1
94
+ value: 61.9
95
  ---
96
 
97
+ # ModernCamemBERT-bio-base
98
 
99
+ *ModernCamemBERT-bio is available in two sizes: [base](https://huggingface.co/almanach/ModernCamemBERT-bio-base) (150M parameters) and [large](https://huggingface.co/almanach/ModernCamemBERT-bio-large) (350M parameters). Our code is available in our [GitHub repository](https://github.com/Rian-T/colm2026-clm-detour).*
100
 
101
  ## Table of Contents
102
 
 
109
 
110
  ## Model Summary
111
 
112
+ ModernCamemBERT-bio is a French biomedical encoder built by continued pretraining of [ModernCamemBERT](https://huggingface.co/almanach/moderncamembert-base) using a **CLM detour** recipe. Instead of standard MLM continued pretraining, we temporarily switch to causal language modeling (CLM) before returning to MLM. This produces lasting representational changes in early transformer layers that improve downstream biomedical performance by +2.8pp on average across 8 French biomedical tasks.
113
 
114
  The model uses the ModernBERT architecture with FlashAttention, rotary positional embeddings (RoPE), alternating local/global attention, and unpadding, supporting **8,192-token context** — critical for long clinical documents that exceed the 512-token limit of previous French biomedical models.
115
 
 
143
  ```python
144
  from transformers import AutoTokenizer, AutoModelForMaskedLM
145
 
146
+ model_id = "almanach/ModernCamemBERT-bio-base"
147
  tokenizer = AutoTokenizer.from_pretrained(model_id)
148
  model = AutoModelForMaskedLM.from_pretrained(model_id)
149
 
 
162
  ```python
163
  from transformers import AutoTokenizer, AutoModel
164
 
165
+ model_id = "almanach/ModernCamemBERT-bio-base"
166
  tokenizer = AutoTokenizer.from_pretrained(model_id)
167
  model = AutoModel.from_pretrained(model_id)
168
 
 
172
  # outputs.last_hidden_state: [batch, seq_len, 768]
173
  ```
174
 
175
+ **Note:** ModernCamemBERT-bio does not use token type IDs. You can omit the `token_type_ids` parameter.
176
 
177
  ## Training
178
 
 
188
 
189
  ### Methodology
190
 
191
+ ModernCamemBERT-bio is trained in two phases, initialized from [ModernCamemBERT](https://huggingface.co/almanach/moderncamembert-base):
192
 
193
  * **Phase 1 — CLM detour (10B tokens):** The bidirectional attention mask is replaced with a causal mask, and the model is trained with next-token prediction. This dense training signal (100% of positions) deeply modifies early transformer layers for domain adaptation.
194
  * **Phase 2 — MLM decay (1B tokens):** Bidirectional attention is restored, and the model is trained with masked language modeling at 15% masking. The learning rate decays from peak to 10% following a 1-sqrt schedule.
 
205
 
206
  | Model | Ctx | FrACCO-30 | FrACCO-100 | CANTEMIST | DISTEMIST | MedDialog | DiaMed | EMEA | Medline | **Avg** |
207
  |-------|-----|-----------|------------|-----------|-----------|-----------|--------|------|---------|---------|
208
+ | **ModernCamemBERT-bio-base** | 8192 | **74.8** | **60.1** | **71.0** | **25.5** | 63.6 | **67.4** | 68.6 | 61.9 | **61.6** |
209
+ | MLM baseline (ours) | 8192 | 69.9 | 56.8 | 64.9 | 23.5 | 62.5 | 63.4 | 68.5 | 61.4 | 58.9 |
210
+ | ModernCamemBERT | 8192 | 70.1 | 55.3 | 63.3 | 20.2 | 60.6 | 56.4 | 68.0 | 59.7 | 56.7 |
211
+ | DrBERT | 512 | 53.0 | 35.6 | 37.9 | 21.4 | 63.6 | 57.0 | **69.6** | **62.8** | 50.1 |
212
+ | CamemBERT-bio | 512 | 41.9 | 20.1 | 12.8 | 9.6 | 38.6 | 47.7 | **70.8** | **65.2** | 38.3 |
213
 
214
+ ModernCamemBERT-bio-base outperforms the matched MLM baseline on all 8 tasks (+2.8pp, binomial p=0.004).
215
 
216
  ## Intended Use
217
 
 
227
 
228
  | Model | Language | Parameters |
229
  |-------|----------|------------|
230
+ | [ModernBERT-bio-base](https://huggingface.co/almanach/ModernBERT-bio-base) | English | 149M |
231
+ | [ModernBERT-bio-large](https://huggingface.co/almanach/ModernBERT-bio-large) | English | 396M |
232
+ | [ModernCamemBERT-bio-base](https://huggingface.co/almanach/ModernCamemBERT-bio-base) | French | 150M |
233
+ | [ModernCamemBERT-bio-large](https://huggingface.co/almanach/ModernCamemBERT-bio-large) | French | 350M |
234
 
235
  ## Limitations
236
 
 
245
  ## Citation
246
 
247
  ```bibtex
248
+ @inproceedings{touchent2026clm,
249
  title={A Causal Language Modeling Detour Improves Encoder Continued Pretraining},
250
+ author={Touchent, Rian and de la Clergerie, {\'E}ric},
251
  booktitle={Proceedings of COLM},
252
  year={2026}
253
  }
 
255
 
256
  ## Acknowledgments
257
 
258
+ This work was performed using HPC resources from GENCI-IDRIS (Grant 2024-AD011015883). We thank the ALMAnaCH team at Inria for the ModernCamemBERT base checkpoint.