Model Card: Nllb-200-distilled-600M_236K
This model is a domain-adapted version of facebook/nllb-200-distilled-600M, fine-tuned on 236k EnglishโFrench sentence pairs from the bioinformatics and biomedical domains.
It is designed for English โ French Machine Translation.
โ๏ธ Model Details
Model Description
- Developed by: Jurgi Giraud
- Model type: Multilingual language model
- Language(s) (NLP): English to French
- License: CC-BY-NC-4.0
- Finetuned from model: facebook/nllb-200-distilled-600M
This model was fine-tuned as part of a PhD research project investigating domain adaptation for Machine Translation (MT) in low-resource scenario within the bioinformatics domain (English โ French). The project explores the performance of compact MT models and Large Language Models (LLMs), including architectures under 1B parameters as well as models in the 3Bโ8B range, with a strong emphasis on resource-efficient fine-tuning strategies. The fine-tuning process made use of Parameter-Efficient Fine-Tuning (PEFT) and quantization, in particular QLoRA (Quantized Low-Rank Adaptation), for larger models (Dettmers et al., 2023).
In total, 5 models were fine-tuned on in-domain data: t5_236k | nllb-200-distilled-600M_236K (๐ current model) | madlad400-3b-mt_236k | TowerInstruct-7B-v0.2_236k | and Llama-3.1-8B-Instruct_236K
๐ Usage
This model is intended to be used for English โ French Machine Translation in the bioinformatics domain.
Example (GPU)
Find below an example of basic usage with GPU using Hugging Face's Transformers library.
First, install dependencies:
pip install torch transformers
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("jurgiraud/nllb-200-distilled-600M_236K")
model = AutoModelForSeq2SeqLM.from_pretrained("jurgiraud/nllb-200-distilled-600M_236K").to("cuda")
def translate(text):
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(
**inputs,
forced_bos_token_id=tokenizer.convert_tokens_to_ids("fra_Latn"),
max_new_tokens=128
) #feel free to change max_new_tokens
return tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translate("The deletion of a gene may result in death or in a block of cell division."))
#La suppression d'un gรจne peut entraรฎner la mort ou un blocage de la division cellulaire.
๐ง Fine-tuning Details
Fine-tuning Data
The model was fine-tuned on a set of 236k English-French parallel examples consisting of:
- Natural parallel data (bioinformatics and biomedical data)
- Synthetic data, including:
- Back-translation of in-domain monolingual texts
- Paraphrased data
- Terminology-constrained synthetic generation
Fine-tuning dataset available ๐ here.
Fine-tuning Procedure
The model was fine-tuned using transformers Seq2SeqTrainer.
Fine-tuning Hyperparameters
Key hyperparameters and training setup:
- Approach:
Seq2SeqTrainer - Training: 8 epochs, learning rate =
2e-5, batch size =16(per device) - Precision: bfloat16 (bf16)
๐ Evaluation
The model was evaluated on an in-domain bioinformatics test set using standard MT metrics.
Testing Data & Metrics
Testing Data
Test set available ๐ here.
Metrics
- BLEU
- chrF++ (chrF2)
- TER
- COMET
Results
Results from automated metrics. Baseline vs domain-adapted model. Best scores in bold.
| Models | BLEUโ | chRF2โ | TERโ | COMETโ |
|---|---|---|---|---|
| Baseline model Nllb-200-distilled-600M | 41.00 | 68.48 | 49.26 | 84.54 |
| Domain-adapted model Nllb-200-distilled-600M_236K | 44.77 | 71.64 | 46.24 | 85.84 |
๐ฑ Environmental Impact
The fine-tuning carbon footprint was estimated using the Green Algorithms framework (Lannelongue et al., 2021).
- Carbon emissions: 960.23 gCOโe
- Energy consumption: 4.15 kWh
๐ Citation
BibTeX:
@phdthesis{giraud2026bioinformaticsMT,
title = {Developing Machine Translation for Bioinformatics: An Exploration into Domain-Specific Terminology, Domain Adaptation, and Evaluation},
author = {Giraud, Jurgi},
school = {The Open University},
year = {2026},
type = {Doctor of Philosophy ({PhD}) thesis},
doi = {10.21954/ou.ro.00109555},
url = {https://doi.org/10.21954/ou.ro.00109555},
}
- Downloads last month
- 40
Model tree for jurgiraud/nllb-200-distilled-600M_236K
Base model
facebook/nllb-200-distilled-600M