--- title: README emoji: 👁 colorFrom: purple colorTo: gray sdk: static pinned: false --- # 🩹 MedInjection-FR A **French biomedical instruction dataset and model suite** for studying how data provenance (**native, synthetic, translated**) impacts instruction-tuning of LLMs. ## 📊 Dataset Stats **Total size**: 571,436 instruction–response pairs **Components**: - Native: 77,247 - Synthetic: 76,506 - Translated: 417,674 **Tasks**: - MCQU (single-answer) - MCQ (multi-answer) - OEQ (open-ended) ## Paper ```bibtex @misc{belmadani2026medinjectionfrexploringrolenative, title={MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning}, author={Ikram Belmadani and Oumaima El Khettari and Pacôme Constant dit Beaufils and Benoit Favre and Richard Dufour}, year={2026}, eprint={2603.06905}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.06905}, } ``` ***