boods commited on
Commit
8ebe089
·
verified ·
1 Parent(s): 5065e74

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +393 -0
README.md ADDED
@@ -0,0 +1,393 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - fr
4
+ license: apache-2.0
5
+ library_name: transformers
6
+ tags:
7
+ - medical
8
+ - french
9
+ - question-answering
10
+ - lora
11
+ - peft
12
+ - qlora
13
+ - domain-adaptation
14
+ - clinical-nlp
15
+ - french-medical
16
+ - extractive-qa
17
+ - abstractive-qa
18
+ - multiple-choice-qa
19
+ base_model: Qwen/Qwen3-14B
20
+ pipeline_tag: text-generation
21
+ metrics:
22
+ - accuracy
23
+ - f1
24
+ inference: true
25
+ datasets:
26
+ - HealthDataHub/PARCOMED_research_only
27
+ ---
28
+
29
+ # EnMed-Unified — French Medical LLM (Multi-Task)
30
+
31
+ > **Headline system of the EnMed family.**
32
+ > A Qwen3-14B decoder adapted for French medical question answering through
33
+ > domain-adaptive continual pre-training (DAPT) on a large French health corpus,
34
+ > followed by **multi-task LoRA fine-tuning** across three QA formats simultaneously.
35
+ >
36
+ > Phase 1 evaluation establishes **4 statistically significant wins** over the
37
+ > un-adapted Qwen3-14B-vanilla baseline (BH-corrected, *q* = 0.05) with
38
+ > **zero significant losses** across nine independent *(task × shot)* evaluation cells.
39
+
40
+ ---
41
+
42
+ ## Model Family Overview
43
+
44
+ The **EnMed** family consists of five variants, all built on Qwen3-14B:
45
+
46
+ | Model | Adapter | Description |
47
+ |---|---|---|
48
+ | **EnMed-Unified** ⭐ | DAPT + Mixed LoRA | **Headline system.** Multi-task adapter trained jointly on all three QA tasks. Best deployment choice — never significantly worse than the base model on any task/shot combination. |
49
+ | EnMed-DAPT | DAPT only | Domain-adapted backbone, no task-specific LoRA. Statistically indistinguishable from Qwen3-14B-vanilla — confirms DAPT does not cause catastrophic forgetting. |
50
+ | EnMed-MCQA | DAPT + MCQA LoRA | Specialised for French medical multiple-choice QA. Safe specialist: 2 significant wins on its home task, zero losses. |
51
+ | EnMed-ExtQA | DAPT + ExtQA LoRA | Specialised for clinical span extraction. Gains on MCQA and 0-shot ExtQA but degrades abstractive QA. |
52
+ | EnMed-AbsQA | DAPT + AbsQA LoRA | Specialised for abstractive generation. Paradoxically degrades its home task under LLM-as-judge scoring while improving MCQA. See Limitations. |
53
+
54
+ ---
55
+
56
+ ## Intended Uses
57
+
58
+ ### Supported tasks
59
+
60
+ - **French Medical Multiple-Choice QA** — select the best answer from 4–5 candidates (e.g., medical licensing exam questions from FrenchMedMCQA / DrBenchmark)
61
+ - **French Clinical Extractive QA** — identify and return verbatim answer spans from French clinical case narratives (CAS corpus format)
62
+ - **French Medical Abstractive QA** — generate free-form answers to open-ended French medical questions (MediQAl format)
63
+
64
+ ### Out-of-scope uses
65
+
66
+ - ⚠️ **Clinical decision support / patient-facing deployment** — this is a **research prototype**. It has **not** been validated for real clinical use. Do not use outputs to guide patient care.
67
+ - **English-only medical QA** — the DAPT stage targets French; English capability may have drifted from the base model.
68
+ - **Languages other than French** — not evaluated.
69
+ - **NER, summarisation, or classification** — not part of the training or evaluation protocol.
70
+
71
+ ---
72
+
73
+ ## Quick Start
74
+
75
+ ```python
76
+ from transformers import AutoTokenizer, AutoModelForCausalLM
77
+ import torch
78
+
79
+ model_id = "brice-eloundou/EnMed-Unified" # replace with your actual HF repo
80
+
81
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
82
+ model = AutoModelForCausalLM.from_pretrained(
83
+ model_id,
84
+ torch_dtype=torch.bfloat16,
85
+ device_map="auto",
86
+ )
87
+
88
+ # ── Multiple-Choice QA ───────────────────────────────────────────────────────
89
+ prompt = """Tu es un expert médical francophone. Réponds à la question suivante
90
+ en choisissant la meilleure réponse parmi les options proposées.
91
+
92
+ Question: Quelle est la principale cause d'insuffisance rénale aiguë en réanimation ?
93
+ A) Glomérulonéphrite aiguë
94
+ B) Nécrose tubulaire aiguë ischémique
95
+ C) Pyélonéphrite aiguë
96
+ D) Lithiase urinaire
97
+
98
+ Réponse:"""
99
+
100
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
101
+ with torch.no_grad():
102
+ out = model.generate(**inputs, max_new_tokens=16, temperature=0.1, do_sample=False)
103
+ print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
104
+ ```
105
+
106
+ ### Log-probability decoding (recommended for MCQA)
107
+
108
+ For evaluation and benchmarking, score each option under teacher forcing and
109
+ select the highest-likelihood token — this matches the evaluation protocol used
110
+ in the paper and avoids format-compliance failures.
111
+
112
+ ```python
113
+ import torch, torch.nn.functional as F
114
+
115
+ def score_option(model, tokenizer, prefix, option_text):
116
+ text = prefix + option_text
117
+ enc = tokenizer(text, return_tensors="pt").to(model.device)
118
+ prefix_len = tokenizer(prefix, return_tensors="pt")["input_ids"].shape[1]
119
+ with torch.no_grad():
120
+ logits = model(**enc).logits[0, prefix_len-1:-1]
121
+ option_ids = enc["input_ids"][0, prefix_len:]
122
+ lp = F.log_softmax(logits, dim=-1)
123
+ return lp[range(len(option_ids)), option_ids].sum().item()
124
+
125
+ options = {"A": "Glomérulonéphrite aiguë",
126
+ "B": "Nécrose tubulaire aiguë ischémique",
127
+ "C": "Pyélonéphrite aiguë",
128
+ "D": "Lithiase urinaire"}
129
+ scores = {k: score_option(model, tokenizer, prefix=prompt, option_text=v)
130
+ for k, v in options.items()}
131
+ print("Predicted:", max(scores, key=scores.get))
132
+ ```
133
+
134
+ ---
135
+
136
+ ## Training Details
137
+
138
+ ### Base model
139
+
140
+ [Qwen/Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B) — instruction-tuned release.
141
+
142
+ ### Stage 1 — Domain-Adaptive Continual Pre-training (DAPT)
143
+
144
+ The backbone undergoes continual pre-training on the **French health corpus**
145
+ introduced by Mannion et al. (2026), a large openly licensed collection of French
146
+ clinical and biomedical text. This stage uses no task supervision; it exposes the
147
+ model to French medical vocabulary and discourse without committing to a downstream
148
+ task format.
149
+
150
+ ### Stage 2 — Multi-Task LoRA Fine-tuning
151
+
152
+ A single LoRA adapter is trained jointly on all three downstream QA tasks,
153
+ with task identifiers embedded in the prompt. This design prevents the
154
+ length/style register over-fitting that degrades single-task adapters under
155
+ LLM-as-judge evaluation (see Limitations).
156
+
157
+ | Hyperparameter | Value |
158
+ |---|---|
159
+ | LoRA rank *r* | 16 |
160
+ | LoRA scaling α | 32 |
161
+ | LoRA dropout | 0.05 |
162
+ | Target modules | Attention + MLP projection matrices |
163
+ | Quantisation | 4-bit NormalFloat (QLoRA / `bitsandbytes`) |
164
+ | Optimiser | AdamW (paged) |
165
+ | LR schedule | Cosine with linear warmup (3 % of steps) |
166
+ | Peak learning rate | 2 × 10⁻⁴ |
167
+ | Effective batch size | 16 (gradient accumulation) |
168
+ | Hardware | 1 × NVIDIA A100 80 GB |
169
+ | Framework | [Unsloth](https://github.com/unslothai/unsloth) + [HuggingFace PEFT](https://github.com/huggingface/peft) |
170
+
171
+ ---
172
+
173
+ ## Evaluation
174
+
175
+ All eight systems were evaluated on three French medical QA tasks under
176
+ 0-shot, 3-shot, and 5-shot prompting — a 3 × 3 grid of nine independent
177
+ *(task, shot)* cells. Item-level paired *t*-tests were conducted per cell
178
+ against Qwen3-14B-vanilla, with Benjamini–Hochberg FDR control (*q* = 0.05)
179
+ and Bonferroni bound reported alongside.
180
+
181
+ | Task | Dataset | *N* (test) | Primary metric |
182
+ |---|---|---|---|
183
+ | Multiple-choice QA (MCQA) | FrenchMedMCQA / DrBenchmark | 622 | Accuracy |
184
+ | Extractive QA (ExtQA) | CAS clinical cases | 207 | Token-level F₁ |
185
+ | Abstractive QA (AbsQA) | MediQAl | 247–248 | LLM-as-judge 1–5 (Gemma) |
186
+
187
+ ---
188
+
189
+ ### Raw scores across all models and shot counts
190
+
191
+ ![Raw scores per model per shot count across MCQA (accuracy), ExtQA (token-F1) and AbsQA (LLM-as-judge). The dotted line marks Qwen3-14B-vanilla 0-shot performance.](figures/fig01_raw_bars.png)
192
+
193
+ *The dotted line marks the Qwen3-14B-vanilla 0-shot reference. EnMed variants
194
+ consistently sit above or on the reference for MCQA and ExtQA; the AbsQA panel
195
+ reveals the EnMed-AbsQA collapse discussed in Limitations.*
196
+
197
+ ---
198
+
199
+ ### Per-task means (averaged over 0 / 3 / 5-shot)
200
+
201
+ | Model | MCQA acc. ↑ | ExtQA F₁ ↑ | AbsQA judge ↑ |
202
+ |---|---|---|---|
203
+ | **EnMed-Unified** ⭐ | **0.575** | **0.529** | 3.195 |
204
+ | EnMed-MCQA | 0.569 | 0.507 | **3.242** |
205
+ | EnMed-ExtQA | 0.572 | **0.533** | 3.082 |
206
+ | EnMed-DAPT | 0.546 | 0.504 | 3.242 |
207
+ | EnMed-AbsQA | **0.582** | 0.506 | 2.997 |
208
+ | Qwen3-14B-vanilla *(reference)* | 0.548 | 0.502 | 3.240 |
209
+ | Qwen3-8B | 0.466 | 0.511 | 3.144 |
210
+ | Mistral-7B-Instruct-v0.3 | 0.277 | 0.445 | 2.926 |
211
+
212
+ ![Per-task means ± 1 std across the three shot counts. Hatched bar = Qwen3-14B-vanilla reference; red dashed line = its mean. Descriptive only.](figures/fig05_per_task_mean_std.png)
213
+
214
+ ---
215
+
216
+ ### Global descriptive ranking (normalised, 9 cells)
217
+
218
+ ![Global descriptive ranking: mean normalised score across the 9 (task, shot) cells ± 1 std. The dashed line marks the Qwen3-14B-vanilla mean of 0.537. EnMed-Unified leads with mean 0.551 and the smallest standard deviation.](figures/fig06_global_mean_std.png)
219
+
220
+ | Model | Mean | Std |
221
+ |---|---|---|
222
+ | **EnMed-Unified** | **0.551** | **0.026** |
223
+ | EnMed-MCQA | 0.545 | 0.035 |
224
+ | EnMed-ExtQA | 0.542 | 0.028 |
225
+ | EnMed-DAPT | 0.537 | 0.034 |
226
+ | Qwen3-14B-vanilla | 0.537 | 0.034 |
227
+ | EnMed-AbsQA | 0.529 | 0.043 |
228
+ | Qwen3-8B | 0.505 | 0.041 |
229
+ | Mistral-7B-Instruct-v0.3 | 0.401 | 0.103 |
230
+
231
+ *This ranking is descriptive only — normalisation across incomparable metric scales
232
+ does not constitute a significance test.*
233
+
234
+ ---
235
+
236
+ ### Normalised scores across all 9 (task × shot) cells
237
+
238
+ ![Normalised scores across the 9 (task, shot) cells. Each cell is rescaled so that the worst-performing system maps to 0 and the best to 1. Rows sorted by descending global mean.](figures/fig02_normalized_heatmap.png)
239
+
240
+ ---
241
+
242
+ ### Per-cell deltas versus Qwen3-14B-vanilla
243
+
244
+ ![Per-cell delta of each EnMed candidate against Qwen3-14B-vanilla. Positive (red) = candidate outperforms reference. Three panels: MCQA accuracy, ExtQA token-F1, AbsQA LLM-as-judge.](figures/fig03_delta_heatmaps.png)
245
+
246
+ ---
247
+
248
+ ### Item-level paired t-tests with 95 % confidence intervals
249
+
250
+ ![Item-level paired t-tests against Qwen3-14B-vanilla. Each bar is the mean delta ± 95% CI computed from N=622 (MCQA), N=207 (ExtQA), N≈248 (AbsQA) paired observations. Stars: * p<0.05, ** p<0.01, *** p<0.001. Inferential figure.](figures/fig07_item_level_ttest.png)
251
+
252
+ *Positive bars mean the EnMed variant outperforms the reference; negative bars
253
+ mean the opposite. Only starred bars represent statistically significant differences.*
254
+
255
+ ---
256
+
257
+ ### Significance heatmap — per-cell annotated deltas
258
+
259
+ ![Per-cell signed delta of each EnMed candidate against Qwen3-14B-vanilla annotated with paired-t significance (* p<0.05, ** p<0.01, *** p<0.001; ns otherwise). Reading a row gives the per-system win/loss record.](figures/fig08_sig_heatmap.png)
260
+
261
+ ---
262
+
263
+ ### Statistical significance record vs. Qwen3-14B-vanilla
264
+
265
+ *(9 independent item-level paired t-tests; α = 0.05; BH-corrected wins marked)*
266
+
267
+ | Model | Sig. wins / 9 | Sig. losses / 9 | Verdict |
268
+ |---|---|---|---|
269
+ | **EnMed-Unified** ⭐ | **4** ✅ BH-robust | **0** | Significantly better on MCQA-0, MCQA-3, ExtQA-0, ExtQA-3; never worse |
270
+ | EnMed-MCQA | 2 | 0 | Safe MCQA specialist |
271
+ | EnMed-ExtQA | 3 | 3 | Mixed: wins MCQA + ExtQA-0, loses all AbsQA cells |
272
+ | EnMed-AbsQA | 3 | 3 | Mixed: wins all MCQA, loses all AbsQA |
273
+ | EnMed-DAPT | 0 | 0 | Indistinguishable from reference — confirms DAPT safety |
274
+
275
+ ![Significance record across all 9 (task, shot) cells per system: dark green = sig. wins, light green = numeric wins, light red = numeric losses, dark red = sig. losses. Dotted line = 4.5-cell majority threshold.](figures/fig10_sig_summary.png)
276
+
277
+ ---
278
+
279
+ ### Best model at every (task × shot) cell
280
+
281
+ ![Best-performing system at every (task, shot) cell. Each cell is coloured by system identity and labelled with the winning raw score. No single model wins all 9 cells.](figures/fig11_best_per_cell.png)
282
+
283
+ *No single system wins all nine cells: EnMed-AbsQA leads MCQA, EnMed-ExtQA leads
284
+ 0- and 5-shot ExtQA, and AbsQA cells split across EnMed-DAPT, Qwen3-14B-vanilla
285
+ and EnMed-MCQA. EnMed-Unified does not lead any single cell but is never the worst.*
286
+
287
+ ---
288
+
289
+ ### Critical Difference diagrams — rank analysis per shot count
290
+
291
+ Average rank across the three tasks (lower = better). Critical difference CD = 6.06.
292
+
293
+ ![Critical Difference diagram, 0-shot. Average rank of each system across 3 tasks. CD=6.06. EnMed-Unified and EnMed-ExtQA are tied best-ranked at 3.00; Mistral-7B is worst at 7.67.](figures/cd_0shot.png)
294
+
295
+ ![Critical Difference diagram, 3-shot. EnMed-Unified leads at 2.83; Mistral-7B is worst at 8.00. CD=6.06.](figures/cd_3shot.png)
296
+
297
+ ![Critical Difference diagram, 5-shot. EnMed-MCQA leads at 2.33; EnMed-Unified second at 3.00. Mistral-7B worst at 8.00. CD=6.06.](figures/cd_5shot.png)
298
+
299
+ *The CD (6.06) exceeds the observed rank spread, so these diagrams are descriptive
300
+ consensus rankings — they corroborate but do not independently prove the item-level
301
+ findings above.*
302
+
303
+ ---
304
+
305
+ ## Limitations
306
+
307
+ **Multiplicity.** Benjamini–Hochberg correction at *q* = 0.05 confirms EnMed-Unified's
308
+ four headline wins. Weaker cells (e.g., ExtQA-3, MCQA-5) do not survive correction
309
+ and should be treated as suggestive.
310
+
311
+ **Distributional assumptions.** Paired *t*-tests assume approximately normal per-item
312
+ differences, which may not hold for binary MCQA outcomes or ordinal 1–5 judge scores.
313
+ A fully ordinal-aware treatment remains future work.
314
+
315
+ **Single-judge evaluation.** AbsQA scores were generated by a single Gemma-family
316
+ LLM-as-judge. Single-judge evaluations are susceptible to judge-specific biases; a
317
+ predominantly English-trained judge may under-reward answers correct under French
318
+ clinical conventions. Judge diversity and order-invariance checks have not been
319
+ conducted.
320
+
321
+ **Task-specific adapter paradox.** EnMed-AbsQA and EnMed-ExtQA improve MCQA while
322
+ significantly degrading their own nominal home task under LLM-as-judge scoring. We
323
+ attribute this to over-fitting to a length/style register the judge penalises.
324
+ Multi-task training (EnMed-Unified) mitigates this.
325
+
326
+ **Phase 2 not yet released.** This is the Phase 1 model. The full cross-lingual
327
+ continual pre-training pipeline (English biomedical → French medical transfer)
328
+ will be released as EnMed-Phase2.
329
+
330
+ **⚠️ Not for clinical deployment.** This model has not been clinically validated.
331
+ Do not use it for patient-facing applications or clinical decision support.
332
+
333
+ ---
334
+
335
+ ## Citation
336
+
337
+ The associated paper has been **submitted** to Springer Lecture Notes in Computer
338
+ Science (LNCS) and is currently **under review**. If you use EnMed-Unified or any
339
+ member of the EnMed family, please cite the preprint version:
340
+
341
+ ```bibtex
342
+ @unpublished{abodoeloundou2025enmed,
343
+ title = {Cross-Lingual Domain Adaptation and Multi-Task Fine-Tuning
344
+ for High-Fidelity Medical Language Models},
345
+ author = {Abodo Eloundou, Brice Donald and Malykh, Valentin},
346
+ note = {Submitted to Springer Lecture Notes in Computer Science (LNCS).
347
+ Under review. ITMO University / MTS Web Services,
348
+ Saint Petersburg, Russia},
349
+ year = {2025}
350
+ }
351
+ ```
352
+
353
+ *This entry will be updated to a full `@inproceedings` citation upon acceptance.*
354
+
355
+ If you use the French health pre-training corpus, please also cite:
356
+
357
+ ```bibtex
358
+ @article{mannion2026biomedical,
359
+ title = {Is biomedical specialization still worth it?
360
+ Insights from domain-adaptive language modelling
361
+ with a new French health corpus},
362
+ author = {Mannion, A. and Macaire, C. and Violle, A. and
363
+ Ohayon, S. and Tannier, X. and Schwab, D. and others},
364
+ journal = {arXiv preprint arXiv:2604.06903},
365
+ year = {2026}
366
+ }
367
+ ```
368
+
369
+ ---
370
+
371
+ ## Acknowledgements
372
+
373
+ Research conducted at **ITMO University**, Saint Petersburg, Russia and
374
+ **MTS Web Services**, Saint Petersburg, Russia.
375
+
376
+ **Authors:**
377
+ - **Brice Donald Abodo Eloundou** — ITMO University &nbsp;|&nbsp; ORCID: [0009-0009-1845-5867](https://orcid.org/0009-0009-1845-5867)
378
+ - **Valentin Malykh** — MTS Web Services / ITMO University
379
+
380
+ Evaluation benchmarks: DrBenchmark (Labrak et al., 2024), FrenchMedMCQA
381
+ (Labrak et al., 2022), MediQAl (Bazoge, 2025), CAS corpus (Grabar et al., 2020).
382
+
383
+ ---
384
+
385
+ ## License
386
+
387
+ Released under **Apache 2.0**, consistent with the Qwen3-14B base model license.
388
+ The pre-training corpus license follows Mannion et al. (2026); users are responsible
389
+ for compliance with that corpus's terms.
390
+
391
+ > **Clinical use warning:** This model is a research artefact. Any use in clinical
392
+ > or patient-facing settings requires independent clinical validation and regulatory
393
+ > approval in the applicable jurisdiction.