Text Generation
Transformers
Safetensors
PEFT
French
qwen3
medical
french
question-answering
lora
qlora
domain-adaptation
clinical-nlp
french-medical
extractive-qa
abstractive-qa
multiple-choice-qa
conversational
text-generation-inference
Instructions to use boods/EnToFrMedicaLLM-Multilingual with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use boods/EnToFrMedicaLLM-Multilingual with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="boods/EnToFrMedicaLLM-Multilingual") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("boods/EnToFrMedicaLLM-Multilingual") model = AutoModelForCausalLM.from_pretrained("boods/EnToFrMedicaLLM-Multilingual") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - PEFT
How to use boods/EnToFrMedicaLLM-Multilingual with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use boods/EnToFrMedicaLLM-Multilingual with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "boods/EnToFrMedicaLLM-Multilingual" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "boods/EnToFrMedicaLLM-Multilingual", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/boods/EnToFrMedicaLLM-Multilingual
- SGLang
How to use boods/EnToFrMedicaLLM-Multilingual with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "boods/EnToFrMedicaLLM-Multilingual" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "boods/EnToFrMedicaLLM-Multilingual", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "boods/EnToFrMedicaLLM-Multilingual" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "boods/EnToFrMedicaLLM-Multilingual", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use boods/EnToFrMedicaLLM-Multilingual with Docker Model Runner:
docker model run hf.co/boods/EnToFrMedicaLLM-Multilingual
| language: | |
| - fr | |
| license: apache-2.0 | |
| library_name: transformers | |
| tags: | |
| - medical | |
| - french | |
| - question-answering | |
| - lora | |
| - peft | |
| - qlora | |
| - domain-adaptation | |
| - clinical-nlp | |
| - french-medical | |
| - extractive-qa | |
| - abstractive-qa | |
| - multiple-choice-qa | |
| base_model: Qwen/Qwen3-14B | |
| pipeline_tag: text-generation | |
| metrics: | |
| - accuracy | |
| - f1 | |
| inference: true | |
| datasets: | |
| - HealthDataHub/PARCOMED_research_only | |
| # EnMed-Unified — French Medical LLM (Multi-Task) | |
| > **Headline system of the EnMed family.** | |
| > A Qwen3-14B decoder adapted for French medical question answering through | |
| > domain-adaptive continual pre-training (DAPT) on a large French health corpus, | |
| > followed by **multi-task LoRA fine-tuning** across three QA formats simultaneously. | |
| > | |
| > Phase 1 evaluation establishes **4 statistically significant wins** over the | |
| > un-adapted Qwen3-14B-vanilla baseline (BH-corrected, *q* = 0.05) with | |
| > **zero significant losses** across nine independent *(task × shot)* evaluation cells. | |
| --- | |
| ## Model Family Overview | |
| The **EnMed** family consists of five variants, all built on Qwen3-14B: | |
| | Model | Adapter | Description | | |
| |---|---|---| | |
| | **EnMed-Unified** ⭐ | DAPT + Mixed LoRA | **Headline system.** Multi-task adapter trained jointly on all three QA tasks. Best deployment choice — never significantly worse than the base model on any task/shot combination. | | |
| | EnMed-DAPT | DAPT only | Domain-adapted backbone, no task-specific LoRA. Statistically indistinguishable from Qwen3-14B-vanilla — confirms DAPT does not cause catastrophic forgetting. | | |
| | EnMed-MCQA | DAPT + MCQA LoRA | Specialised for French medical multiple-choice QA. Safe specialist: 2 significant wins on its home task, zero losses. | | |
| | EnMed-ExtQA | DAPT + ExtQA LoRA | Specialised for clinical span extraction. Gains on MCQA and 0-shot ExtQA but degrades abstractive QA. | | |
| | EnMed-AbsQA | DAPT + AbsQA LoRA | Specialised for abstractive generation. Paradoxically degrades its home task under LLM-as-judge scoring while improving MCQA. See Limitations. | | |
| --- | |
| ## Intended Uses | |
| ### Supported tasks | |
| - **French Medical Multiple-Choice QA** — select the best answer from 4–5 candidates (e.g., medical licensing exam questions from FrenchMedMCQA / DrBenchmark) | |
| - **French Clinical Extractive QA** — identify and return verbatim answer spans from French clinical case narratives (CAS corpus format) | |
| - **French Medical Abstractive QA** — generate free-form answers to open-ended French medical questions (MediQAl format) | |
| ### Out-of-scope uses | |
| - ⚠️ **Clinical decision support / patient-facing deployment** — this is a **research prototype**. It has **not** been validated for real clinical use. Do not use outputs to guide patient care. | |
| - **English-only medical QA** — the DAPT stage targets French; English capability may have drifted from the base model. | |
| - **Languages other than French** — not evaluated. | |
| - **NER, summarisation, or classification** — not part of the training or evaluation protocol. | |
| --- | |
| ## Quick Start | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForCausalLM | |
| import torch | |
| model_id = "brice-eloundou/EnMed-Unified" # replace with your actual HF repo | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| model_id, | |
| torch_dtype=torch.bfloat16, | |
| device_map="auto", | |
| ) | |
| # ── Multiple-Choice QA ─────────────────────────────────────────────────────── | |
| prompt = """Tu es un expert médical francophone. Réponds à la question suivante | |
| en choisissant la meilleure réponse parmi les options proposées. | |
| Question: Quelle est la principale cause d'insuffisance rénale aiguë en réanimation ? | |
| A) Glomérulonéphrite aiguë | |
| B) Nécrose tubulaire aiguë ischémique | |
| C) Pyélonéphrite aiguë | |
| D) Lithiase urinaire | |
| Réponse:""" | |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) | |
| with torch.no_grad(): | |
| out = model.generate(**inputs, max_new_tokens=16, temperature=0.1, do_sample=False) | |
| print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)) | |
| ``` | |
| ### Log-probability decoding (recommended for MCQA) | |
| For evaluation and benchmarking, score each option under teacher forcing and | |
| select the highest-likelihood token — this matches the evaluation protocol used | |
| in the paper and avoids format-compliance failures. | |
| ```python | |
| import torch, torch.nn.functional as F | |
| def score_option(model, tokenizer, prefix, option_text): | |
| text = prefix + option_text | |
| enc = tokenizer(text, return_tensors="pt").to(model.device) | |
| prefix_len = tokenizer(prefix, return_tensors="pt")["input_ids"].shape[1] | |
| with torch.no_grad(): | |
| logits = model(**enc).logits[0, prefix_len-1:-1] | |
| option_ids = enc["input_ids"][0, prefix_len:] | |
| lp = F.log_softmax(logits, dim=-1) | |
| return lp[range(len(option_ids)), option_ids].sum().item() | |
| options = {"A": "Glomérulonéphrite aiguë", | |
| "B": "Nécrose tubulaire aiguë ischémique", | |
| "C": "Pyélonéphrite aiguë", | |
| "D": "Lithiase urinaire"} | |
| scores = {k: score_option(model, tokenizer, prefix=prompt, option_text=v) | |
| for k, v in options.items()} | |
| print("Predicted:", max(scores, key=scores.get)) | |
| ``` | |
| --- | |
| ## Training Details | |
| ### Base model | |
| [Qwen/Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B) — instruction-tuned release. | |
| ### Stage 1 — Domain-Adaptive Continual Pre-training (DAPT) | |
| The backbone undergoes continual pre-training on the **French health corpus** | |
| introduced by Mannion et al. (2026), a large openly licensed collection of French | |
| clinical and biomedical text. This stage uses no task supervision; it exposes the | |
| model to French medical vocabulary and discourse without committing to a downstream | |
| task format. | |
| ### Stage 2 — Multi-Task LoRA Fine-tuning | |
| A single LoRA adapter is trained jointly on all three downstream QA tasks, | |
| with task identifiers embedded in the prompt. This design prevents the | |
| length/style register over-fitting that degrades single-task adapters under | |
| LLM-as-judge evaluation (see Limitations). | |
| | Hyperparameter | Value | | |
| |---|---| | |
| | LoRA rank *r* | 16 | | |
| | LoRA scaling α | 32 | | |
| | LoRA dropout | 0.05 | | |
| | Target modules | Attention + MLP projection matrices | | |
| | Quantisation | 4-bit NormalFloat (QLoRA / `bitsandbytes`) | | |
| | Optimiser | AdamW (paged) | | |
| | LR schedule | Cosine with linear warmup (3 % of steps) | | |
| | Peak learning rate | 2 × 10⁻⁴ | | |
| | Effective batch size | 16 (gradient accumulation) | | |
| | Hardware | 1 × NVIDIA A100 80 GB | | |
| | Framework | [Unsloth](https://github.com/unslothai/unsloth) + [HuggingFace PEFT](https://github.com/huggingface/peft) | | |
| --- | |
| ## Evaluation | |
| All eight systems were evaluated on three French medical QA tasks under | |
| 0-shot, 3-shot, and 5-shot prompting — a 3 × 3 grid of nine independent | |
| *(task, shot)* cells. Item-level paired *t*-tests were conducted per cell | |
| against Qwen3-14B-vanilla, with Benjamini–Hochberg FDR control (*q* = 0.05) | |
| and Bonferroni bound reported alongside. | |
| | Task | Dataset | *N* (test) | Primary metric | | |
| |---|---|---|---| | |
| | Multiple-choice QA (MCQA) | FrenchMedMCQA / DrBenchmark | 622 | Accuracy | | |
| | Extractive QA (ExtQA) | CAS clinical cases | 207 | Token-level F₁ | | |
| | Abstractive QA (AbsQA) | MediQAl | 247–248 | LLM-as-judge 1–5 (Gemma) | | |
| --- | |
| ### Raw scores across all models and shot counts | |
|  | |
| *The dotted line marks the Qwen3-14B-vanilla 0-shot reference. EnMed variants | |
| consistently sit above or on the reference for MCQA and ExtQA; the AbsQA panel | |
| reveals the EnMed-AbsQA collapse discussed in Limitations.* | |
| --- | |
| ### Per-task means (averaged over 0 / 3 / 5-shot) | |
| | Model | MCQA acc. ↑ | ExtQA F₁ ↑ | AbsQA judge ↑ | | |
| |---|---|---|---| | |
| | **EnMed-Unified** ⭐ | **0.575** | **0.529** | 3.195 | | |
| | EnMed-MCQA | 0.569 | 0.507 | **3.242** | | |
| | EnMed-ExtQA | 0.572 | **0.533** | 3.082 | | |
| | EnMed-DAPT | 0.546 | 0.504 | 3.242 | | |
| | EnMed-AbsQA | **0.582** | 0.506 | 2.997 | | |
| | Qwen3-14B-vanilla *(reference)* | 0.548 | 0.502 | 3.240 | | |
| | Qwen3-8B | 0.466 | 0.511 | 3.144 | | |
| | Mistral-7B-Instruct-v0.3 | 0.277 | 0.445 | 2.926 | | |
|  | |
| --- | |
| ### Global descriptive ranking (normalised, 9 cells) | |
|  | |
| | Model | Mean | Std | | |
| |---|---|---| | |
| | **EnMed-Unified** | **0.551** | **0.026** | | |
| | EnMed-MCQA | 0.545 | 0.035 | | |
| | EnMed-ExtQA | 0.542 | 0.028 | | |
| | EnMed-DAPT | 0.537 | 0.034 | | |
| | Qwen3-14B-vanilla | 0.537 | 0.034 | | |
| | EnMed-AbsQA | 0.529 | 0.043 | | |
| | Qwen3-8B | 0.505 | 0.041 | | |
| | Mistral-7B-Instruct-v0.3 | 0.401 | 0.103 | | |
| *This ranking is descriptive only — normalisation across incomparable metric scales | |
| does not constitute a significance test.* | |
| --- | |
| ### Normalised scores across all 9 (task × shot) cells | |
|  | |
| --- | |
| ### Per-cell deltas versus Qwen3-14B-vanilla | |
|  | |
| --- | |
| ### Item-level paired t-tests with 95 % confidence intervals | |
|  | |
| *Positive bars mean the EnMed variant outperforms the reference; negative bars | |
| mean the opposite. Only starred bars represent statistically significant differences.* | |
| --- | |
| ### Significance heatmap — per-cell annotated deltas | |
|  | |
| --- | |
| ### Statistical significance record vs. Qwen3-14B-vanilla | |
| *(9 independent item-level paired t-tests; α = 0.05; BH-corrected wins marked)* | |
| | Model | Sig. wins / 9 | Sig. losses / 9 | Verdict | | |
| |---|---|---|---| | |
| | **EnMed-Unified** ⭐ | **4** ✅ BH-robust | **0** | Significantly better on MCQA-0, MCQA-3, ExtQA-0, ExtQA-3; never worse | | |
| | EnMed-MCQA | 2 | 0 | Safe MCQA specialist | | |
| | EnMed-ExtQA | 3 | 3 | Mixed: wins MCQA + ExtQA-0, loses all AbsQA cells | | |
| | EnMed-AbsQA | 3 | 3 | Mixed: wins all MCQA, loses all AbsQA | | |
| | EnMed-DAPT | 0 | 0 | Indistinguishable from reference — confirms DAPT safety | | |
|  | |
| --- | |
| ### Best model at every (task × shot) cell | |
|  | |
| *No single system wins all nine cells: EnMed-AbsQA leads MCQA, EnMed-ExtQA leads | |
| 0- and 5-shot ExtQA, and AbsQA cells split across EnMed-DAPT, Qwen3-14B-vanilla | |
| and EnMed-MCQA. EnMed-Unified does not lead any single cell but is never the worst.* | |
| --- | |
| ### Critical Difference diagrams — rank analysis per shot count | |
| Average rank across the three tasks (lower = better). Critical difference CD = 6.06. | |
|  | |
|  | |
|  | |
| *The CD (6.06) exceeds the observed rank spread, so these diagrams are descriptive | |
| consensus rankings — they corroborate but do not independently prove the item-level | |
| findings above.* | |
| --- | |
| ## Limitations | |
| **Multiplicity.** Benjamini–Hochberg correction at *q* = 0.05 confirms EnMed-Unified's | |
| four headline wins. Weaker cells (e.g., ExtQA-3, MCQA-5) do not survive correction | |
| and should be treated as suggestive. | |
| **Distributional assumptions.** Paired *t*-tests assume approximately normal per-item | |
| differences, which may not hold for binary MCQA outcomes or ordinal 1–5 judge scores. | |
| A fully ordinal-aware treatment remains future work. | |
| **Single-judge evaluation.** AbsQA scores were generated by a single Gemma-family | |
| LLM-as-judge. Single-judge evaluations are susceptible to judge-specific biases; a | |
| predominantly English-trained judge may under-reward answers correct under French | |
| clinical conventions. Judge diversity and order-invariance checks have not been | |
| conducted. | |
| **Task-specific adapter paradox.** EnMed-AbsQA and EnMed-ExtQA improve MCQA while | |
| significantly degrading their own nominal home task under LLM-as-judge scoring. We | |
| attribute this to over-fitting to a length/style register the judge penalises. | |
| Multi-task training (EnMed-Unified) mitigates this. | |
| **Phase 2 not yet released.** This is the Phase 1 model. The full cross-lingual | |
| continual pre-training pipeline (English biomedical → French medical transfer) | |
| will be released as EnMed-Phase2. | |
| **⚠️ Not for clinical deployment.** This model has not been clinically validated. | |
| Do not use it for patient-facing applications or clinical decision support. | |
| --- | |
| ## Citation | |
| The associated paper has been **submitted** to Springer Lecture Notes in Computer | |
| Science (LNCS) and is currently **under review**. If you use EnMed-Unified or any | |
| member of the EnMed family, please cite the preprint version: | |
| ```bibtex | |
| @unpublished{abodoeloundou2025enmed, | |
| title = {Cross-Lingual Domain Adaptation and Multi-Task Fine-Tuning | |
| for High-Fidelity Medical Language Models}, | |
| author = {Abodo Eloundou, Brice Donald and Malykh, Valentin}, | |
| note = {Submitted to Springer Lecture Notes in Computer Science (LNCS). | |
| Under review. ITMO University / MTS Web Services, | |
| Saint Petersburg, Russia}, | |
| year = {2026} | |
| } | |
| ``` | |
| *This entry will be updated to a full `@inproceedings` citation upon acceptance.* | |
| If you use the French health pre-training corpus, please also cite: | |
| ```bibtex | |
| @article{mannion2026biomedical, | |
| title = {Is biomedical specialization still worth it? | |
| Insights from domain-adaptive language modelling | |
| with a new French health corpus}, | |
| author = {Mannion, A. and Macaire, C. and Violle, A. and | |
| Ohayon, S. and Tannier, X. and Schwab, D. and others}, | |
| journal = {arXiv preprint arXiv:2604.06903}, | |
| year = {2026} | |
| } | |
| ``` | |
| --- | |
| ## Acknowledgements | |
| Research conducted at **ITMO University**, Saint Petersburg, Russia and | |
| **MTS Web Services**, Saint Petersburg, Russia. | |
| **Authors:** | |
| - **Brice Donald Abodo Eloundou** — ITMO University | ORCID: [0009-0009-1845-5867](https://orcid.org/0009-0009-1845-5867) | |
| - **Valentin Malykh** — MTS Web Services / ITMO University | |
| Evaluation benchmarks: DrBenchmark (Labrak et al., 2024), FrenchMedMCQA | |
| (Labrak et al., 2022), MediQAl (Bazoge, 2025), CAS corpus (Grabar et al., 2020). | |
| --- | |
| ## License | |
| Released under **Apache 2.0**, consistent with the Qwen3-14B base model license. | |
| The pre-training corpus license follows Mannion et al. (2026); users are responsible | |
| for compliance with that corpus's terms. | |
| > **Clinical use warning:** This model is a research artefact. Any use in clinical | |
| > or patient-facing settings requires independent clinical validation and regulatory | |
| > approval in the applicable jurisdiction. |