rntc commited on
Commit
38eee8c
·
verified ·
1 Parent(s): e091b09

Update model card: add model-index, datasets, training details, related models

Browse files
Files changed (1) hide show
  1. README.md +92 -7
README.md CHANGED
@@ -17,10 +17,86 @@ pipeline_tag: fill-mask
17
  widget:
18
  - text: "Les patients atteints de <mask> présentent un risque accru de complications cardiovasculaires."
19
  - text: "Le traitement par <mask> a montré une amélioration significative des symptômes."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  ---
21
 
22
  # cpt-fr-base
23
 
 
24
 
25
  ## Table of Contents
26
 
@@ -33,7 +109,7 @@ widget:
33
 
34
  ## Model Summary
35
 
36
- cpt-fr-base is a French biomedical encoder built by continued pretraining of [ModernCamemBERT](https://huggingface.co/almanach/moderncamembert-base) using a **CLM detour** recipe. Instead of standard MLM continued pretraining, we temporarily switch to causal language modeling (CLM) before returning to MLM. This produces lasting representational changes in early transformer layers that improve downstream biomedical performance by +2.9pp on average across 8 French biomedical tasks.
37
 
38
  The model uses the ModernBERT architecture with FlashAttention, rotary positional embeddings (RoPE), alternating local/global attention, and unpadding, supporting **8,192-token context** — critical for long clinical documents that exceed the 512-token limit of previous French biomedical models.
39
 
@@ -96,7 +172,7 @@ outputs = model(**inputs)
96
  # outputs.last_hidden_state: [batch, seq_len, 768]
97
  ```
98
 
99
- **Note:** cpt-fr-base does not use token type IDs. You can omit the `token_type_ids` parameter.
100
 
101
  ## Training
102
 
@@ -112,12 +188,12 @@ outputs = model(**inputs)
112
 
113
  ### Methodology
114
 
115
- cpt-fr-base is trained in two phases, initialized from [ModernCamemBERT](https://huggingface.co/almanach/moderncamembert-base):
116
 
117
  * **Phase 1 — CLM detour (10B tokens):** The bidirectional attention mask is replaced with a causal mask, and the model is trained with next-token prediction. This dense training signal (100% of positions) deeply modifies early transformer layers for domain adaptation.
118
  * **Phase 2 — MLM decay (1B tokens):** Bidirectional attention is restored, and the model is trained with masked language modeling at 15% masking. The learning rate decays from peak to 10% following a 1-sqrt schedule.
119
 
120
- Both phases use the same data mix. Training used AdamW (lr=2e-4, beta1=0.9, beta2=0.98), bf16 mixed precision, global batch size of 384 sequences (~3.1M tokens), on 4x H100 GPUs with [Composer](https://github.com/mosaicml/composer).
121
 
122
  ### Why a CLM Detour?
123
 
@@ -147,6 +223,15 @@ This model is designed for French biomedical and clinical NLP tasks:
147
 
148
  The 8,192-token context is critical for long clinical documents (discharge summaries, oncology reports) that are truncated by 512-token models.
149
 
 
 
 
 
 
 
 
 
 
150
  ## Limitations
151
 
152
  - Trained on French biomedical text; not suitable for other languages without further adaptation.
@@ -161,13 +246,13 @@ Apache 2.0
161
 
162
  ```bibtex
163
  @inproceedings{anonymous2026clm,
164
- title={Under review},
165
  author={Anonymous},
166
- booktitle={Under review},
167
  year={2026}
168
  }
169
  ```
170
 
171
  ## Acknowledgments
172
 
173
-
 
17
  widget:
18
  - text: "Les patients atteints de <mask> présentent un risque accru de complications cardiovasculaires."
19
  - text: "Le traitement par <mask> a montré une amélioration significative des symptômes."
20
+ model-index:
21
+ - name: cpt-fr-base
22
+ results:
23
+ - task:
24
+ type: text-classification
25
+ name: Text Classification
26
+ dataset:
27
+ name: FrACCO-30
28
+ type: rntc/fracco
29
+ metrics:
30
+ - type: f1
31
+ value: 74.8
32
+ - task:
33
+ type: text-classification
34
+ name: Text Classification
35
+ dataset:
36
+ name: FrACCO-100
37
+ type: rntc/fracco
38
+ metrics:
39
+ - type: f1
40
+ value: 60.1
41
+ - task:
42
+ type: text-classification
43
+ name: Text Classification
44
+ dataset:
45
+ name: CANTEMIST
46
+ type: cantemist
47
+ metrics:
48
+ - type: f1
49
+ value: 71.0
50
+ - task:
51
+ type: text-classification
52
+ name: Text Classification
53
+ dataset:
54
+ name: DISTEMIST
55
+ type: distemist
56
+ metrics:
57
+ - type: f1
58
+ value: 25.5
59
+ - task:
60
+ type: text-classification
61
+ name: Text Classification
62
+ dataset:
63
+ name: MedDialog
64
+ type: meddialog
65
+ metrics:
66
+ - type: f1
67
+ value: 63.6
68
+ - task:
69
+ type: text-classification
70
+ name: Text Classification
71
+ dataset:
72
+ name: DiaMed
73
+ type: diamed
74
+ metrics:
75
+ - type: f1
76
+ value: 67.4
77
+ - task:
78
+ type: token-classification
79
+ name: NER
80
+ dataset:
81
+ name: EMEA
82
+ type: emea
83
+ metrics:
84
+ - type: f1
85
+ value: 65.9
86
+ - task:
87
+ type: token-classification
88
+ name: NER
89
+ dataset:
90
+ name: Medline
91
+ type: medline
92
+ metrics:
93
+ - type: f1
94
+ value: 58.2
95
  ---
96
 
97
  # cpt-fr-base
98
 
99
+ *cpt-fr is available in two sizes: [base](https://huggingface.co/rntc/cpt-fr-base) (150M parameters) and [large](https://huggingface.co/rntc/cpt-fr-large) (350M parameters). Our code will be released upon publication.*
100
 
101
  ## Table of Contents
102
 
 
109
 
110
  ## Model Summary
111
 
112
+ cpt-fr is a French biomedical encoder built by continued pretraining of [ModernCamemBERT](https://huggingface.co/almanach/moderncamembert-base) using a **CLM detour** recipe. Instead of standard MLM continued pretraining, we temporarily switch to causal language modeling (CLM) before returning to MLM. This produces lasting representational changes in early transformer layers that improve downstream biomedical performance by +2.9pp on average across 8 French biomedical tasks.
113
 
114
  The model uses the ModernBERT architecture with FlashAttention, rotary positional embeddings (RoPE), alternating local/global attention, and unpadding, supporting **8,192-token context** — critical for long clinical documents that exceed the 512-token limit of previous French biomedical models.
115
 
 
172
  # outputs.last_hidden_state: [batch, seq_len, 768]
173
  ```
174
 
175
+ **Note:** cpt-fr does not use token type IDs. You can omit the `token_type_ids` parameter.
176
 
177
  ## Training
178
 
 
188
 
189
  ### Methodology
190
 
191
+ cpt-fr is trained in two phases, initialized from [ModernCamemBERT](https://huggingface.co/almanach/moderncamembert-base):
192
 
193
  * **Phase 1 — CLM detour (10B tokens):** The bidirectional attention mask is replaced with a causal mask, and the model is trained with next-token prediction. This dense training signal (100% of positions) deeply modifies early transformer layers for domain adaptation.
194
  * **Phase 2 — MLM decay (1B tokens):** Bidirectional attention is restored, and the model is trained with masked language modeling at 15% masking. The learning rate decays from peak to 10% following a 1-sqrt schedule.
195
 
196
+ Both phases use the same data mix (11B tokens total). Training used AdamW (lr=2e-4, beta1=0.9, beta2=0.98), bf16 mixed precision, global batch size of 384 sequences (~3.1M tokens), on H100 80GB GPUs with [Composer](https://github.com/mosaicml/composer). Total training time: ~5 GPU-hours.
197
 
198
  ### Why a CLM Detour?
199
 
 
223
 
224
  The 8,192-token context is critical for long clinical documents (discharge summaries, oncology reports) that are truncated by 512-token models.
225
 
226
+ ## Related Models
227
+
228
+ | Model | Language | Parameters |
229
+ |-------|----------|------------|
230
+ | [cpt-en-base](https://huggingface.co/rntc/cpt-en-base) | English | 149M |
231
+ | [cpt-en-large](https://huggingface.co/rntc/cpt-en-large) | English | 396M |
232
+ | [cpt-fr-base](https://huggingface.co/rntc/cpt-fr-base) | French | 150M |
233
+ | [cpt-fr-large](https://huggingface.co/rntc/cpt-fr-large) | French | 350M |
234
+
235
  ## Limitations
236
 
237
  - Trained on French biomedical text; not suitable for other languages without further adaptation.
 
246
 
247
  ```bibtex
248
  @inproceedings{anonymous2026clm,
249
+ title={A Causal Language Modeling Detour Improves Encoder Continued Pretraining},
250
  author={Anonymous},
251
+ booktitle={Proceedings of COLM},
252
  year={2026}
253
  }
254
  ```
255
 
256
  ## Acknowledgments
257
 
258
+ This work was performed using HPC resources.