Instructions to use boods/EnToFrMedicaLLM-Multilingual with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use boods/EnToFrMedicaLLM-Multilingual with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="boods/EnToFrMedicaLLM-Multilingual")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("boods/EnToFrMedicaLLM-Multilingual")
model = AutoModelForCausalLM.from_pretrained("boods/EnToFrMedicaLLM-Multilingual")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

PEFT
How to use boods/EnToFrMedicaLLM-Multilingual with PEFT:
```
Task type is invalid.
```
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use boods/EnToFrMedicaLLM-Multilingual with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "boods/EnToFrMedicaLLM-Multilingual"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "boods/EnToFrMedicaLLM-Multilingual",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/boods/EnToFrMedicaLLM-Multilingual

SGLang

How to use boods/EnToFrMedicaLLM-Multilingual with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "boods/EnToFrMedicaLLM-Multilingual" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "boods/EnToFrMedicaLLM-Multilingual",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "boods/EnToFrMedicaLLM-Multilingual" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "boods/EnToFrMedicaLLM-Multilingual",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use boods/EnToFrMedicaLLM-Multilingual with Docker Model Runner:
```
docker model run hf.co/boods/EnToFrMedicaLLM-Multilingual
```

EnToFrMedicaLLM-Multilingual / README.md

boods

Update README.md

5547c59 verified 4 days ago

preview code

raw

history blame contribute delete

16.5 kB

	---
	language:
	- fr
	license: apache-2.0
	library_name: transformers
	tags:
	- medical
	- french
	- question-answering
	- lora
	- peft
	- qlora
	- domain-adaptation
	- clinical-nlp
	- french-medical
	- extractive-qa
	- abstractive-qa
	- multiple-choice-qa
	base_model: Qwen/Qwen3-14B
	pipeline_tag: text-generation
	metrics:
	- accuracy
	- f1
	inference: true
	datasets:
	- HealthDataHub/PARCOMED_research_only
	---

	# EnMed-Unified — French Medical LLM (Multi-Task)

	> Headline system of the EnMed family.
	> A Qwen3-14B decoder adapted for French medical question answering through
	> domain-adaptive continual pre-training (DAPT) on a large French health corpus,
	> followed by multi-task LoRA fine-tuning across three QA formats simultaneously.
	>
	> Phase 1 evaluation establishes 4 statistically significant wins over the
	> un-adapted Qwen3-14B-vanilla baseline (BH-corrected, q = 0.05) with
	> zero significant losses across nine independent (task × shot) evaluation cells.

	---

	## Model Family Overview

	The EnMed family consists of five variants, all built on Qwen3-14B:

	\| Model \| Adapter \| Description \|
	\|---\|---\|---\|
	\| EnMed-Unified ⭐ \| DAPT + Mixed LoRA \| Headline system. Multi-task adapter trained jointly on all three QA tasks. Best deployment choice — never significantly worse than the base model on any task/shot combination. \|
	\| EnMed-DAPT \| DAPT only \| Domain-adapted backbone, no task-specific LoRA. Statistically indistinguishable from Qwen3-14B-vanilla — confirms DAPT does not cause catastrophic forgetting. \|
	\| EnMed-MCQA \| DAPT + MCQA LoRA \| Specialised for French medical multiple-choice QA. Safe specialist: 2 significant wins on its home task, zero losses. \|
	\| EnMed-ExtQA \| DAPT + ExtQA LoRA \| Specialised for clinical span extraction. Gains on MCQA and 0-shot ExtQA but degrades abstractive QA. \|
	\| EnMed-AbsQA \| DAPT + AbsQA LoRA \| Specialised for abstractive generation. Paradoxically degrades its home task under LLM-as-judge scoring while improving MCQA. See Limitations. \|

	---

	## Intended Uses

	### Supported tasks

	- French Medical Multiple-Choice QA — select the best answer from 4–5 candidates (e.g., medical licensing exam questions from FrenchMedMCQA / DrBenchmark)
	- French Clinical Extractive QA — identify and return verbatim answer spans from French clinical case narratives (CAS corpus format)
	- French Medical Abstractive QA — generate free-form answers to open-ended French medical questions (MediQAl format)

	### Out-of-scope uses

	- ⚠️ Clinical decision support / patient-facing deployment — this is a research prototype. It has not been validated for real clinical use. Do not use outputs to guide patient care.
	- English-only medical QA — the DAPT stage targets French; English capability may have drifted from the base model.
	- Languages other than French — not evaluated.
	- NER, summarisation, or classification — not part of the training or evaluation protocol.

	---

	## Quick Start

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	model_id = "brice-eloundou/EnMed-Unified" # replace with your actual HF repo

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)

	# ── Multiple-Choice QA ───────────────────────────────────────────────────────
	prompt = """Tu es un expert médical francophone. Réponds à la question suivante
	en choisissant la meilleure réponse parmi les options proposées.

	Question: Quelle est la principale cause d'insuffisance rénale aiguë en réanimation ?
	A) Glomérulonéphrite aiguë
	B) Nécrose tubulaire aiguë ischémique
	C) Pyélonéphrite aiguë
	D) Lithiase urinaire

	Réponse:"""

	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	with torch.no_grad():
	out = model.generate(**inputs, max_new_tokens=16, temperature=0.1, do_sample=False)
	print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
	```

	### Log-probability decoding (recommended for MCQA)

	For evaluation and benchmarking, score each option under teacher forcing and
	select the highest-likelihood token — this matches the evaluation protocol used
	in the paper and avoids format-compliance failures.

	```python
	import torch, torch.nn.functional as F

	def score_option(model, tokenizer, prefix, option_text):
	text = prefix + option_text
	enc = tokenizer(text, return_tensors="pt").to(model.device)
	prefix_len = tokenizer(prefix, return_tensors="pt")["input_ids"].shape[1]
	with torch.no_grad():
	logits = model(**enc).logits[0, prefix_len-1:-1]
	option_ids = enc["input_ids"][0, prefix_len:]
	lp = F.log_softmax(logits, dim=-1)
	return lp[range(len(option_ids)), option_ids].sum().item()

	options = {"A": "Glomérulonéphrite aiguë",
	"B": "Nécrose tubulaire aiguë ischémique",
	"C": "Pyélonéphrite aiguë",
	"D": "Lithiase urinaire"}
	scores = {k: score_option(model, tokenizer, prefix=prompt, option_text=v)
	for k, v in options.items()}
	print("Predicted:", max(scores, key=scores.get))
	```

	---

	## Training Details

	### Base model

	[Qwen/Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B) — instruction-tuned release.

	### Stage 1 — Domain-Adaptive Continual Pre-training (DAPT)

	The backbone undergoes continual pre-training on the French health corpus
	introduced by Mannion et al. (2026), a large openly licensed collection of French
	clinical and biomedical text. This stage uses no task supervision; it exposes the
	model to French medical vocabulary and discourse without committing to a downstream
	task format.

	### Stage 2 — Multi-Task LoRA Fine-tuning

	A single LoRA adapter is trained jointly on all three downstream QA tasks,
	with task identifiers embedded in the prompt. This design prevents the
	length/style register over-fitting that degrades single-task adapters under
	LLM-as-judge evaluation (see Limitations).

	\| Hyperparameter \| Value \|
	\|---\|---\|
	\| LoRA rank r \| 16 \|
	\| LoRA scaling α \| 32 \|
	\| LoRA dropout \| 0.05 \|
	\| Target modules \| Attention + MLP projection matrices \|
	\| Quantisation \| 4-bit NormalFloat (QLoRA / `bitsandbytes`) \|
	\| Optimiser \| AdamW (paged) \|
	\| LR schedule \| Cosine with linear warmup (3 % of steps) \|
	\| Peak learning rate \| 2 × 10⁻⁴ \|
	\| Effective batch size \| 16 (gradient accumulation) \|
	\| Hardware \| 1 × NVIDIA A100 80 GB \|
	\| Framework \| [Unsloth](https://github.com/unslothai/unsloth) + [HuggingFace PEFT](https://github.com/huggingface/peft) \|

	---

	## Evaluation

	All eight systems were evaluated on three French medical QA tasks under
	0-shot, 3-shot, and 5-shot prompting — a 3 × 3 grid of nine independent
	(task, shot) cells. Item-level paired t-tests were conducted per cell
	against Qwen3-14B-vanilla, with Benjamini–Hochberg FDR control (q = 0.05)
	and Bonferroni bound reported alongside.

	\| Task \| Dataset \| N (test) \| Primary metric \|
	\|---\|---\|---\|---\|
	\| Multiple-choice QA (MCQA) \| FrenchMedMCQA / DrBenchmark \| 622 \| Accuracy \|
	\| Extractive QA (ExtQA) \| CAS clinical cases \| 207 \| Token-level F₁ \|
	\| Abstractive QA (AbsQA) \| MediQAl \| 247–248 \| LLM-as-judge 1–5 (Gemma) \|

	---

	### Raw scores across all models and shot counts

	![Raw scores per model per shot count across MCQA (accuracy), ExtQA (token-F1) and AbsQA (LLM-as-judge). The dotted line marks Qwen3-14B-vanilla 0-shot performance.](figures/fig01_raw_bars.png)

	*The dotted line marks the Qwen3-14B-vanilla 0-shot reference. EnMed variants
	consistently sit above or on the reference for MCQA and ExtQA; the AbsQA panel
	reveals the EnMed-AbsQA collapse discussed in Limitations.*

	---

	### Per-task means (averaged over 0 / 3 / 5-shot)

	\| Model \| MCQA acc. ↑ \| ExtQA F₁ ↑ \| AbsQA judge ↑ \|
	\|---\|---\|---\|---\|
	\| EnMed-Unified ⭐ \| 0.575 \| 0.529 \| 3.195 \|
	\| EnMed-MCQA \| 0.569 \| 0.507 \| 3.242 \|
	\| EnMed-ExtQA \| 0.572 \| 0.533 \| 3.082 \|
	\| EnMed-DAPT \| 0.546 \| 0.504 \| 3.242 \|
	\| EnMed-AbsQA \| 0.582 \| 0.506 \| 2.997 \|
	\| Qwen3-14B-vanilla (reference) \| 0.548 \| 0.502 \| 3.240 \|
	\| Qwen3-8B \| 0.466 \| 0.511 \| 3.144 \|
	\| Mistral-7B-Instruct-v0.3 \| 0.277 \| 0.445 \| 2.926 \|

	![Per-task means ± 1 std across the three shot counts. Hatched bar = Qwen3-14B-vanilla reference; red dashed line = its mean. Descriptive only.](figures/fig05_per_task_mean_std.png)

	---

	### Global descriptive ranking (normalised, 9 cells)

	![Global descriptive ranking: mean normalised score across the 9 (task, shot) cells ± 1 std. The dashed line marks the Qwen3-14B-vanilla mean of 0.537. EnMed-Unified leads with mean 0.551 and the smallest standard deviation.](figures/fig06_global_mean_std.png)

	\| Model \| Mean \| Std \|
	\|---\|---\|---\|
	\| EnMed-Unified \| 0.551 \| 0.026 \|
	\| EnMed-MCQA \| 0.545 \| 0.035 \|
	\| EnMed-ExtQA \| 0.542 \| 0.028 \|
	\| EnMed-DAPT \| 0.537 \| 0.034 \|
	\| Qwen3-14B-vanilla \| 0.537 \| 0.034 \|
	\| EnMed-AbsQA \| 0.529 \| 0.043 \|
	\| Qwen3-8B \| 0.505 \| 0.041 \|
	\| Mistral-7B-Instruct-v0.3 \| 0.401 \| 0.103 \|

	*This ranking is descriptive only — normalisation across incomparable metric scales
	does not constitute a significance test.*

	---

	### Normalised scores across all 9 (task × shot) cells

	![Normalised scores across the 9 (task, shot) cells. Each cell is rescaled so that the worst-performing system maps to 0 and the best to 1. Rows sorted by descending global mean.](figures/fig02_normalized_heatmap.png)

	---

	### Per-cell deltas versus Qwen3-14B-vanilla

	![Per-cell delta of each EnMed candidate against Qwen3-14B-vanilla. Positive (red) = candidate outperforms reference. Three panels: MCQA accuracy, ExtQA token-F1, AbsQA LLM-as-judge.](figures/fig03_delta_heatmaps.png)

	---

	### Item-level paired t-tests with 95 % confidence intervals

	![Item-level paired t-tests against Qwen3-14B-vanilla. Each bar is the mean delta ± 95% CI computed from N=622 (MCQA), N=207 (ExtQA), N≈248 (AbsQA) paired observations. Stars: * p<0.05, p<0.01, * p<0.001. Inferential figure.](figures/fig07_item_level_ttest.png)

	*Positive bars mean the EnMed variant outperforms the reference; negative bars
	mean the opposite. Only starred bars represent statistically significant differences.*

	---

	### Significance heatmap — per-cell annotated deltas

	![Per-cell signed delta of each EnMed candidate against Qwen3-14B-vanilla annotated with paired-t significance (* p<0.05, p<0.01, * p<0.001; ns otherwise). Reading a row gives the per-system win/loss record.](figures/fig08_sig_heatmap.png)

	---

	### Statistical significance record vs. Qwen3-14B-vanilla

	(9 independent item-level paired t-tests; α = 0.05; BH-corrected wins marked)

	\| Model \| Sig. wins / 9 \| Sig. losses / 9 \| Verdict \|
	\|---\|---\|---\|---\|
	\| EnMed-Unified ⭐ \| 4 ✅ BH-robust \| 0 \| Significantly better on MCQA-0, MCQA-3, ExtQA-0, ExtQA-3; never worse \|
	\| EnMed-MCQA \| 2 \| 0 \| Safe MCQA specialist \|
	\| EnMed-ExtQA \| 3 \| 3 \| Mixed: wins MCQA + ExtQA-0, loses all AbsQA cells \|
	\| EnMed-AbsQA \| 3 \| 3 \| Mixed: wins all MCQA, loses all AbsQA \|
	\| EnMed-DAPT \| 0 \| 0 \| Indistinguishable from reference — confirms DAPT safety \|

	![Significance record across all 9 (task, shot) cells per system: dark green = sig. wins, light green = numeric wins, light red = numeric losses, dark red = sig. losses. Dotted line = 4.5-cell majority threshold.](figures/fig10_sig_summary.png)

	---

	### Best model at every (task × shot) cell

	![Best-performing system at every (task, shot) cell. Each cell is coloured by system identity and labelled with the winning raw score. No single model wins all 9 cells.](figures/fig11_best_per_cell.png)

	*No single system wins all nine cells: EnMed-AbsQA leads MCQA, EnMed-ExtQA leads
	0- and 5-shot ExtQA, and AbsQA cells split across EnMed-DAPT, Qwen3-14B-vanilla
	and EnMed-MCQA. EnMed-Unified does not lead any single cell but is never the worst.*

	---

	### Critical Difference diagrams — rank analysis per shot count

	Average rank across the three tasks (lower = better). Critical difference CD = 6.06.

	![Critical Difference diagram, 0-shot. Average rank of each system across 3 tasks. CD=6.06. EnMed-Unified and EnMed-ExtQA are tied best-ranked at 3.00; Mistral-7B is worst at 7.67.](figures/cd_0shot.png)

	![Critical Difference diagram, 3-shot. EnMed-Unified leads at 2.83; Mistral-7B is worst at 8.00. CD=6.06.](figures/cd_3shot.png)

	![Critical Difference diagram, 5-shot. EnMed-MCQA leads at 2.33; EnMed-Unified second at 3.00. Mistral-7B worst at 8.00. CD=6.06.](figures/cd_5shot.png)

	*The CD (6.06) exceeds the observed rank spread, so these diagrams are descriptive
	consensus rankings — they corroborate but do not independently prove the item-level
	findings above.*

	---

	## Limitations

	Multiplicity. Benjamini–Hochberg correction at q = 0.05 confirms EnMed-Unified's
	four headline wins. Weaker cells (e.g., ExtQA-3, MCQA-5) do not survive correction
	and should be treated as suggestive.

	Distributional assumptions. Paired t-tests assume approximately normal per-item
	differences, which may not hold for binary MCQA outcomes or ordinal 1–5 judge scores.
	A fully ordinal-aware treatment remains future work.

	Single-judge evaluation. AbsQA scores were generated by a single Gemma-family
	LLM-as-judge. Single-judge evaluations are susceptible to judge-specific biases; a
	predominantly English-trained judge may under-reward answers correct under French
	clinical conventions. Judge diversity and order-invariance checks have not been
	conducted.

	Task-specific adapter paradox. EnMed-AbsQA and EnMed-ExtQA improve MCQA while
	significantly degrading their own nominal home task under LLM-as-judge scoring. We
	attribute this to over-fitting to a length/style register the judge penalises.
	Multi-task training (EnMed-Unified) mitigates this.

	Phase 2 not yet released. This is the Phase 1 model. The full cross-lingual
	continual pre-training pipeline (English biomedical → French medical transfer)
	will be released as EnMed-Phase2.

	⚠️ Not for clinical deployment. This model has not been clinically validated.
	Do not use it for patient-facing applications or clinical decision support.

	---

	## Citation

	The associated paper has been submitted to Springer Lecture Notes in Computer
	Science (LNCS) and is currently under review. If you use EnMed-Unified or any
	member of the EnMed family, please cite the preprint version:

	```bibtex
	@unpublished{abodoeloundou2025enmed,
	title = {Cross-Lingual Domain Adaptation and Multi-Task Fine-Tuning
	for High-Fidelity Medical Language Models},
	author = {Abodo Eloundou, Brice Donald and Malykh, Valentin},
	note = {Submitted to Springer Lecture Notes in Computer Science (LNCS).
	Under review. ITMO University / MTS Web Services,
	Saint Petersburg, Russia},
	year = {2026}
	}
	```

	This entry will be updated to a full `@inproceedings` citation upon acceptance.

	If you use the French health pre-training corpus, please also cite:

	```bibtex
	@article{mannion2026biomedical,
	title = {Is biomedical specialization still worth it?
	Insights from domain-adaptive language modelling
	with a new French health corpus},
	author = {Mannion, A. and Macaire, C. and Violle, A. and
	Ohayon, S. and Tannier, X. and Schwab, D. and others},
	journal = {arXiv preprint arXiv:2604.06903},
	year = {2026}
	}
	```

	---

	## Acknowledgements

	Research conducted at ITMO University, Saint Petersburg, Russia and
	MTS Web Services, Saint Petersburg, Russia.

	Authors:
	- Brice Donald Abodo Eloundou — ITMO University  \|  ORCID: [0009-0009-1845-5867](https://orcid.org/0009-0009-1845-5867)
	- Valentin Malykh — MTS Web Services / ITMO University

	Evaluation benchmarks: DrBenchmark (Labrak et al., 2024), FrenchMedMCQA
	(Labrak et al., 2022), MediQAl (Bazoge, 2025), CAS corpus (Grabar et al., 2020).

	---

	## License

	Released under Apache 2.0, consistent with the Qwen3-14B base model license.
	The pre-training corpus license follows Mannion et al. (2026); users are responsible
	for compliance with that corpus's terms.

	> Clinical use warning: This model is a research artefact. Any use in clinical
	> or patient-facing settings requires independent clinical validation and regulatory
	> approval in the applicable jurisdiction.