Spaces:
Running
Running
metadata
title: README
emoji: 👁
colorFrom: purple
colorTo: gray
sdk: static
pinned: false
🩹 MedInjection-FR
A French biomedical instruction dataset and model suite for studying how data provenance (native, synthetic, translated) impacts instruction-tuning of LLMs.
📊 Dataset Stats
Total size: 571,436 instruction–response pairs
Components:
- Native: 77,247
- Synthetic: 76,506
- Translated: 417,674
Tasks:
- MCQU (single-answer)
- MCQ (multi-answer)
- OEQ (open-ended)
Paper
@misc{belmadani2026medinjectionfrexploringrolenative,
title={MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning},
author={Ikram Belmadani and Oumaima El Khettari and Pacôme Constant dit Beaufils and Benoit Favre and Richard Dufour},
year={2026},
eprint={2603.06905},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2603.06905},
}