m2m100_1.2B fine-tuned: Welsh β†’ English (cy β†’ en)

Fine-tuned version of facebook/m2m100_1.2B for Welsh to English translation.

Evaluation

Best checkpoint at step 70,000 (epoch ~0.77):

Metric Value
BLEU 69.72
Eval loss 0.3294

Training data

The model was trained on a combination of the following datasets:

Dataset Description
techiaith/llyw-cymru-en-cy-ogl Welsh Government / Llywodraeth Cymru parallel corpus
mgrbyte/bydtermcymru-tm-en-cy Byd Term Cymru terminology translation memory
mgrbyte/cardiff-university-tm-en-cy Cardiff University translation memory
mgrbyte/cwm-taf-morgannwg-university-health-board-tm-en-cy Cwm Taf Morgannwg University Health Board translation memory
HuggingFaceFW/finetranslations (cym_Latn subset) FineWeb-derived Welsh/English parallel sentences

All pairs were filtered to a maximum of 200 characters per side, and pairs with a length ratio greater than 3:1 were removed. Tokenised sequences longer than 128 tokens were also excluded.

Training configuration

Parameter Value
Base model facebook/m2m100_1.2B
Epochs 1
Batch size (per device) 32
Gradient accumulation steps 4
Learning rate 2e-5
LR scheduler cosine
Warmup steps 6,000
Precision fp16
Best checkpoint step 70,000

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("DewiBrynJones/m2m100_1.2B-ft-cy-to-en")
model = AutoModelForSeq2SeqLM.from_pretrained("DewiBrynJones/m2m100_1.2B-ft-cy-to-en")

tokenizer.src_lang = "cy"
inputs = tokenizer("Sut mae?", return_tensors="pt")
generated = model.generate(**inputs, forced_bos_token_id=tokenizer.get_lang_id("en"))
print(tokenizer.decode(generated[0], skip_special_tokens=True))
Downloads last month
19
Safetensors
Model size
2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for DewiBrynJones/m2m100_1.2B-ft-cy-to-en

Finetuned
(32)
this model