Model Card: Nllb-200-distilled-600M_236K

This model is a domain-adapted version of facebook/nllb-200-distilled-600M, fine-tuned on 236k English–French sentence pairs from the bioinformatics and biomedical domains.
It is designed for English → French Machine Translation.

✏️ Model Details

Model Description

Developed by: Jurgi Giraud
Model type: Multilingual language model
Language(s) (NLP): English to French
License: CC-BY-NC-4.0
Finetuned from model: facebook/nllb-200-distilled-600M

This model was fine-tuned as part of a PhD research project investigating domain adaptation for Machine Translation (MT) in low-resource scenario within the bioinformatics domain (English → French). The project explores the performance of compact MT models and Large Language Models (LLMs), including architectures under 1B parameters as well as models in the 3B–8B range, with a strong emphasis on resource-efficient fine-tuning strategies. The fine-tuning process made use of Parameter-Efficient Fine-Tuning (PEFT) and quantization, in particular QLoRA (Quantized Low-Rank Adaptation), for larger models (Dettmers et al., 2023).

In total, 5 models were fine-tuned on in-domain data: t5_236k | nllb-200-distilled-600M_236K (👈 current model) | madlad400-3b-mt_236k | TowerInstruct-7B-v0.2_236k | and Llama-3.1-8B-Instruct_236K

🚀 Usage

This model is intended to be used for English → French Machine Translation in the bioinformatics domain.

Example (GPU)

Find below an example of basic usage with GPU using Hugging Face's Transformers library.

First, install dependencies:

pip install torch transformers

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("jurgiraud/nllb-200-distilled-600M_236K")
model = AutoModelForSeq2SeqLM.from_pretrained("jurgiraud/nllb-200-distilled-600M_236K").to("cuda")

def translate(text):
    inputs = tokenizer(text, return_tensors="pt").to("cuda")
    outputs = model.generate(
        **inputs,
        forced_bos_token_id=tokenizer.convert_tokens_to_ids("fra_Latn"),
        max_new_tokens=128
    ) #feel free to change max_new_tokens
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print(translate("The deletion of a gene may result in death or in a block of cell division."))
#La suppression d'un gène peut entraîner la mort ou un blocage de la division cellulaire.

🔧 Fine-tuning Details

Fine-tuning Data

The model was fine-tuned on a set of 236k English-French parallel examples consisting of:

Natural parallel data (bioinformatics and biomedical data)
Synthetic data, including:
- Back-translation of in-domain monolingual texts
- Paraphrased data
- Terminology-constrained synthetic generation

Fine-tuning dataset available 👉 here.

Fine-tuning Procedure

The model was fine-tuned using transformers Seq2SeqTrainer.

Fine-tuning Hyperparameters

Key hyperparameters and training setup:

Approach: Seq2SeqTrainer
Training: 8 epochs, learning rate = 2e-5, batch size = 16 (per device)
Precision: bfloat16 (bf16)

📊 Evaluation

The model was evaluated on an in-domain bioinformatics test set using standard MT metrics.

Testing Data & Metrics

Testing Data

Test set available 👉 here.

Metrics

BLEU
chrF++ (chrF2)
TER
COMET

Results

Results from automated metrics. Baseline vs domain-adapted model. Best scores in bold.

Models	BLEU↑	chRF2↑	TER↓	COMET↑
Baseline model Nllb-200-distilled-600M	41.00	68.48	49.26	84.54
Domain-adapted model Nllb-200-distilled-600M_236K	44.77	71.64	46.24	85.84

🌱 Environmental Impact

The fine-tuning carbon footprint was estimated using the Green Algorithms framework (Lannelongue et al., 2021).

Carbon emissions: 960.23 gCO₂e
Energy consumption: 4.15 kWh

📚 Citation

BibTeX:

@phdthesis{giraud2026bioinformaticsMT,
  title        = {Developing Machine Translation for Bioinformatics: An Exploration into Domain-Specific Terminology, Domain Adaptation, and Evaluation},
  author       = {Giraud, Jurgi},
  school       = {The Open University},
  year         = {2026},
  type         = {Doctor of Philosophy ({PhD}) thesis},
  doi          = {10.21954/ou.ro.00109555},
  url          = {https://doi.org/10.21954/ou.ro.00109555},
}

Downloads last month: 40

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for jurgiraud/nllb-200-distilled-600M_236K

Base model

facebook/nllb-200-distilled-600M

Finetuned

(273)

this model