File size: 16,541 Bytes
8ebe089
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5547c59
8ebe089
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
---
language:
- fr
license: apache-2.0
library_name: transformers
tags:
- medical
- french
- question-answering
- lora
- peft
- qlora
- domain-adaptation
- clinical-nlp
- french-medical
- extractive-qa
- abstractive-qa
- multiple-choice-qa
base_model: Qwen/Qwen3-14B
pipeline_tag: text-generation
metrics:
- accuracy
- f1
inference: true
datasets:
- HealthDataHub/PARCOMED_research_only
---

# EnMed-Unified — French Medical LLM (Multi-Task)

> **Headline system of the EnMed family.**
> A Qwen3-14B decoder adapted for French medical question answering through
> domain-adaptive continual pre-training (DAPT) on a large French health corpus,
> followed by **multi-task LoRA fine-tuning** across three QA formats simultaneously.
>
> Phase 1 evaluation establishes **4 statistically significant wins** over the
> un-adapted Qwen3-14B-vanilla baseline (BH-corrected, *q* = 0.05) with
> **zero significant losses** across nine independent *(task × shot)* evaluation cells.

---

## Model Family Overview

The **EnMed** family consists of five variants, all built on Qwen3-14B:

| Model | Adapter | Description |
|---|---|---|
| **EnMed-Unified** ⭐ | DAPT + Mixed LoRA | **Headline system.** Multi-task adapter trained jointly on all three QA tasks. Best deployment choice — never significantly worse than the base model on any task/shot combination. |
| EnMed-DAPT | DAPT only | Domain-adapted backbone, no task-specific LoRA. Statistically indistinguishable from Qwen3-14B-vanilla — confirms DAPT does not cause catastrophic forgetting. |
| EnMed-MCQA | DAPT + MCQA LoRA | Specialised for French medical multiple-choice QA. Safe specialist: 2 significant wins on its home task, zero losses. |
| EnMed-ExtQA | DAPT + ExtQA LoRA | Specialised for clinical span extraction. Gains on MCQA and 0-shot ExtQA but degrades abstractive QA. |
| EnMed-AbsQA | DAPT + AbsQA LoRA | Specialised for abstractive generation. Paradoxically degrades its home task under LLM-as-judge scoring while improving MCQA. See Limitations. |

---

## Intended Uses

### Supported tasks

- **French Medical Multiple-Choice QA** — select the best answer from 4–5 candidates (e.g., medical licensing exam questions from FrenchMedMCQA / DrBenchmark)
- **French Clinical Extractive QA** — identify and return verbatim answer spans from French clinical case narratives (CAS corpus format)
- **French Medical Abstractive QA** — generate free-form answers to open-ended French medical questions (MediQAl format)

### Out-of-scope uses

- ⚠️ **Clinical decision support / patient-facing deployment** — this is a **research prototype**. It has **not** been validated for real clinical use. Do not use outputs to guide patient care.
- **English-only medical QA** — the DAPT stage targets French; English capability may have drifted from the base model.
- **Languages other than French** — not evaluated.
- **NER, summarisation, or classification** — not part of the training or evaluation protocol.

---

## Quick Start

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "brice-eloundou/EnMed-Unified"   # replace with your actual HF repo

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# ── Multiple-Choice QA ───────────────────────────────────────────────────────
prompt = """Tu es un expert médical francophone. Réponds à la question suivante
en choisissant la meilleure réponse parmi les options proposées.

Question: Quelle est la principale cause d'insuffisance rénale aiguë en réanimation ?
A) Glomérulonéphrite aiguë
B) Nécrose tubulaire aiguë ischémique
C) Pyélonéphrite aiguë
D) Lithiase urinaire

Réponse:"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=16, temperature=0.1, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```

### Log-probability decoding (recommended for MCQA)

For evaluation and benchmarking, score each option under teacher forcing and
select the highest-likelihood token — this matches the evaluation protocol used
in the paper and avoids format-compliance failures.

```python
import torch, torch.nn.functional as F

def score_option(model, tokenizer, prefix, option_text):
    text = prefix + option_text
    enc = tokenizer(text, return_tensors="pt").to(model.device)
    prefix_len = tokenizer(prefix, return_tensors="pt")["input_ids"].shape[1]
    with torch.no_grad():
        logits = model(**enc).logits[0, prefix_len-1:-1]
        option_ids = enc["input_ids"][0, prefix_len:]
        lp = F.log_softmax(logits, dim=-1)
        return lp[range(len(option_ids)), option_ids].sum().item()

options = {"A": "Glomérulonéphrite aiguë",
           "B": "Nécrose tubulaire aiguë ischémique",
           "C": "Pyélonéphrite aiguë",
           "D": "Lithiase urinaire"}
scores = {k: score_option(model, tokenizer, prefix=prompt, option_text=v)
          for k, v in options.items()}
print("Predicted:", max(scores, key=scores.get))
```

---

## Training Details

### Base model

[Qwen/Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B) — instruction-tuned release.

### Stage 1 — Domain-Adaptive Continual Pre-training (DAPT)

The backbone undergoes continual pre-training on the **French health corpus**
introduced by Mannion et al. (2026), a large openly licensed collection of French
clinical and biomedical text. This stage uses no task supervision; it exposes the
model to French medical vocabulary and discourse without committing to a downstream
task format.

### Stage 2 — Multi-Task LoRA Fine-tuning

A single LoRA adapter is trained jointly on all three downstream QA tasks,
with task identifiers embedded in the prompt. This design prevents the
length/style register over-fitting that degrades single-task adapters under
LLM-as-judge evaluation (see Limitations).

| Hyperparameter | Value |
|---|---|
| LoRA rank *r* | 16 |
| LoRA scaling α | 32 |
| LoRA dropout | 0.05 |
| Target modules | Attention + MLP projection matrices |
| Quantisation | 4-bit NormalFloat (QLoRA / `bitsandbytes`) |
| Optimiser | AdamW (paged) |
| LR schedule | Cosine with linear warmup (3 % of steps) |
| Peak learning rate | 2 × 10⁻⁴ |
| Effective batch size | 16 (gradient accumulation) |
| Hardware | 1 × NVIDIA A100 80 GB |
| Framework | [Unsloth](https://github.com/unslothai/unsloth) + [HuggingFace PEFT](https://github.com/huggingface/peft) |

---

## Evaluation

All eight systems were evaluated on three French medical QA tasks under
0-shot, 3-shot, and 5-shot prompting — a 3 × 3 grid of nine independent
*(task, shot)* cells. Item-level paired *t*-tests were conducted per cell
against Qwen3-14B-vanilla, with Benjamini–Hochberg FDR control (*q* = 0.05)
and Bonferroni bound reported alongside.

| Task | Dataset | *N* (test) | Primary metric |
|---|---|---|---|
| Multiple-choice QA (MCQA) | FrenchMedMCQA / DrBenchmark | 622 | Accuracy |
| Extractive QA (ExtQA) | CAS clinical cases | 207 | Token-level F₁ |
| Abstractive QA (AbsQA) | MediQAl | 247–248 | LLM-as-judge 1–5 (Gemma) |

---

### Raw scores across all models and shot counts

![Raw scores per model per shot count across MCQA (accuracy), ExtQA (token-F1) and AbsQA (LLM-as-judge). The dotted line marks Qwen3-14B-vanilla 0-shot performance.](figures/fig01_raw_bars.png)

*The dotted line marks the Qwen3-14B-vanilla 0-shot reference. EnMed variants
consistently sit above or on the reference for MCQA and ExtQA; the AbsQA panel
reveals the EnMed-AbsQA collapse discussed in Limitations.*

---

### Per-task means (averaged over 0 / 3 / 5-shot)

| Model | MCQA acc. ↑ | ExtQA F₁ ↑ | AbsQA judge ↑ |
|---|---|---|---|
| **EnMed-Unified** ⭐ | **0.575** | **0.529** | 3.195 |
| EnMed-MCQA | 0.569 | 0.507 | **3.242** |
| EnMed-ExtQA | 0.572 | **0.533** | 3.082 |
| EnMed-DAPT | 0.546 | 0.504 | 3.242 |
| EnMed-AbsQA | **0.582** | 0.506 | 2.997 |
| Qwen3-14B-vanilla *(reference)* | 0.548 | 0.502 | 3.240 |
| Qwen3-8B | 0.466 | 0.511 | 3.144 |
| Mistral-7B-Instruct-v0.3 | 0.277 | 0.445 | 2.926 |

![Per-task means ± 1 std across the three shot counts. Hatched bar = Qwen3-14B-vanilla reference; red dashed line = its mean. Descriptive only.](figures/fig05_per_task_mean_std.png)

---

### Global descriptive ranking (normalised, 9 cells)

![Global descriptive ranking: mean normalised score across the 9 (task, shot) cells ± 1 std. The dashed line marks the Qwen3-14B-vanilla mean of 0.537. EnMed-Unified leads with mean 0.551 and the smallest standard deviation.](figures/fig06_global_mean_std.png)

| Model | Mean | Std |
|---|---|---|
| **EnMed-Unified** | **0.551** | **0.026** |
| EnMed-MCQA | 0.545 | 0.035 |
| EnMed-ExtQA | 0.542 | 0.028 |
| EnMed-DAPT | 0.537 | 0.034 |
| Qwen3-14B-vanilla | 0.537 | 0.034 |
| EnMed-AbsQA | 0.529 | 0.043 |
| Qwen3-8B | 0.505 | 0.041 |
| Mistral-7B-Instruct-v0.3 | 0.401 | 0.103 |

*This ranking is descriptive only — normalisation across incomparable metric scales
does not constitute a significance test.*

---

### Normalised scores across all 9 (task × shot) cells

![Normalised scores across the 9 (task, shot) cells. Each cell is rescaled so that the worst-performing system maps to 0 and the best to 1. Rows sorted by descending global mean.](figures/fig02_normalized_heatmap.png)

---

### Per-cell deltas versus Qwen3-14B-vanilla

![Per-cell delta of each EnMed candidate against Qwen3-14B-vanilla. Positive (red) = candidate outperforms reference. Three panels: MCQA accuracy, ExtQA token-F1, AbsQA LLM-as-judge.](figures/fig03_delta_heatmaps.png)

---

### Item-level paired t-tests with 95 % confidence intervals

![Item-level paired t-tests against Qwen3-14B-vanilla. Each bar is the mean delta ± 95% CI computed from N=622 (MCQA), N=207 (ExtQA), N≈248 (AbsQA) paired observations. Stars: * p<0.05, ** p<0.01, *** p<0.001. Inferential figure.](figures/fig07_item_level_ttest.png)

*Positive bars mean the EnMed variant outperforms the reference; negative bars
mean the opposite. Only starred bars represent statistically significant differences.*

---

### Significance heatmap — per-cell annotated deltas

![Per-cell signed delta of each EnMed candidate against Qwen3-14B-vanilla annotated with paired-t significance (* p<0.05, ** p<0.01, *** p<0.001; ns otherwise). Reading a row gives the per-system win/loss record.](figures/fig08_sig_heatmap.png)

---

### Statistical significance record vs. Qwen3-14B-vanilla

*(9 independent item-level paired t-tests; α = 0.05; BH-corrected wins marked)*

| Model | Sig. wins / 9 | Sig. losses / 9 | Verdict |
|---|---|---|---|
| **EnMed-Unified** ⭐ | **4** ✅ BH-robust | **0** | Significantly better on MCQA-0, MCQA-3, ExtQA-0, ExtQA-3; never worse |
| EnMed-MCQA | 2 | 0 | Safe MCQA specialist |
| EnMed-ExtQA | 3 | 3 | Mixed: wins MCQA + ExtQA-0, loses all AbsQA cells |
| EnMed-AbsQA | 3 | 3 | Mixed: wins all MCQA, loses all AbsQA |
| EnMed-DAPT | 0 | 0 | Indistinguishable from reference — confirms DAPT safety |

![Significance record across all 9 (task, shot) cells per system: dark green = sig. wins, light green = numeric wins, light red = numeric losses, dark red = sig. losses. Dotted line = 4.5-cell majority threshold.](figures/fig10_sig_summary.png)

---

### Best model at every (task × shot) cell

![Best-performing system at every (task, shot) cell. Each cell is coloured by system identity and labelled with the winning raw score. No single model wins all 9 cells.](figures/fig11_best_per_cell.png)

*No single system wins all nine cells: EnMed-AbsQA leads MCQA, EnMed-ExtQA leads
0- and 5-shot ExtQA, and AbsQA cells split across EnMed-DAPT, Qwen3-14B-vanilla
and EnMed-MCQA. EnMed-Unified does not lead any single cell but is never the worst.*

---

### Critical Difference diagrams — rank analysis per shot count

Average rank across the three tasks (lower = better). Critical difference CD = 6.06.

![Critical Difference diagram, 0-shot. Average rank of each system across 3 tasks. CD=6.06. EnMed-Unified and EnMed-ExtQA are tied best-ranked at 3.00; Mistral-7B is worst at 7.67.](figures/cd_0shot.png)

![Critical Difference diagram, 3-shot. EnMed-Unified leads at 2.83; Mistral-7B is worst at 8.00. CD=6.06.](figures/cd_3shot.png)

![Critical Difference diagram, 5-shot. EnMed-MCQA leads at 2.33; EnMed-Unified second at 3.00. Mistral-7B worst at 8.00. CD=6.06.](figures/cd_5shot.png)

*The CD (6.06) exceeds the observed rank spread, so these diagrams are descriptive
consensus rankings — they corroborate but do not independently prove the item-level
findings above.*

---

## Limitations

**Multiplicity.** Benjamini–Hochberg correction at *q* = 0.05 confirms EnMed-Unified's
four headline wins. Weaker cells (e.g., ExtQA-3, MCQA-5) do not survive correction
and should be treated as suggestive.

**Distributional assumptions.** Paired *t*-tests assume approximately normal per-item
differences, which may not hold for binary MCQA outcomes or ordinal 1–5 judge scores.
A fully ordinal-aware treatment remains future work.

**Single-judge evaluation.** AbsQA scores were generated by a single Gemma-family
LLM-as-judge. Single-judge evaluations are susceptible to judge-specific biases; a
predominantly English-trained judge may under-reward answers correct under French
clinical conventions. Judge diversity and order-invariance checks have not been
conducted.

**Task-specific adapter paradox.** EnMed-AbsQA and EnMed-ExtQA improve MCQA while
significantly degrading their own nominal home task under LLM-as-judge scoring. We
attribute this to over-fitting to a length/style register the judge penalises.
Multi-task training (EnMed-Unified) mitigates this.

**Phase 2 not yet released.** This is the Phase 1 model. The full cross-lingual
continual pre-training pipeline (English biomedical → French medical transfer)
will be released as EnMed-Phase2.

**⚠️ Not for clinical deployment.** This model has not been clinically validated.
Do not use it for patient-facing applications or clinical decision support.

---

## Citation

The associated paper has been **submitted** to Springer Lecture Notes in Computer
Science (LNCS) and is currently **under review**. If you use EnMed-Unified or any
member of the EnMed family, please cite the preprint version:

```bibtex
@unpublished{abodoeloundou2025enmed,
  title  = {Cross-Lingual Domain Adaptation and Multi-Task Fine-Tuning
            for High-Fidelity Medical Language Models},
  author = {Abodo Eloundou, Brice Donald and Malykh, Valentin},
  note   = {Submitted to Springer Lecture Notes in Computer Science (LNCS).
            Under review. ITMO University / MTS Web Services,
            Saint Petersburg, Russia},
  year   = {2026}
}
```

*This entry will be updated to a full `@inproceedings` citation upon acceptance.*

If you use the French health pre-training corpus, please also cite:

```bibtex
@article{mannion2026biomedical,
  title   = {Is biomedical specialization still worth it?
             Insights from domain-adaptive language modelling
             with a new French health corpus},
  author  = {Mannion, A. and Macaire, C. and Violle, A. and
             Ohayon, S. and Tannier, X. and Schwab, D. and others},
  journal = {arXiv preprint arXiv:2604.06903},
  year    = {2026}
}
```

---

## Acknowledgements

Research conducted at **ITMO University**, Saint Petersburg, Russia and
**MTS Web Services**, Saint Petersburg, Russia.

**Authors:**
- **Brice Donald Abodo Eloundou** — ITMO University &nbsp;|&nbsp; ORCID: [0009-0009-1845-5867](https://orcid.org/0009-0009-1845-5867)
- **Valentin Malykh** — MTS Web Services / ITMO University

Evaluation benchmarks: DrBenchmark (Labrak et al., 2024), FrenchMedMCQA
(Labrak et al., 2022), MediQAl (Bazoge, 2025), CAS corpus (Grabar et al., 2020).

---

## License

Released under **Apache 2.0**, consistent with the Qwen3-14B base model license.
The pre-training corpus license follows Mannion et al. (2026); users are responsible
for compliance with that corpus's terms.

> **Clinical use warning:** This model is a research artefact. Any use in clinical
> or patient-facing settings requires independent clinical validation and regulatory
> approval in the applicable jurisdiction.