m2m100_1.2B fine-tuned: Welsh β English (cy β en)
Fine-tuned version of facebook/m2m100_1.2B for Welsh to English translation.
Evaluation
Best checkpoint at step 70,000 (epoch ~0.77):
| Metric | Value |
|---|---|
| BLEU | 69.72 |
| Eval loss | 0.3294 |
Training data
The model was trained on a combination of the following datasets:
| Dataset | Description |
|---|---|
| techiaith/llyw-cymru-en-cy-ogl | Welsh Government / Llywodraeth Cymru parallel corpus |
| mgrbyte/bydtermcymru-tm-en-cy | Byd Term Cymru terminology translation memory |
| mgrbyte/cardiff-university-tm-en-cy | Cardiff University translation memory |
| mgrbyte/cwm-taf-morgannwg-university-health-board-tm-en-cy | Cwm Taf Morgannwg University Health Board translation memory |
| HuggingFaceFW/finetranslations (cym_Latn subset) | FineWeb-derived Welsh/English parallel sentences |
All pairs were filtered to a maximum of 200 characters per side, and pairs with a length ratio greater than 3:1 were removed. Tokenised sequences longer than 128 tokens were also excluded.
Training configuration
| Parameter | Value |
|---|---|
| Base model | facebook/m2m100_1.2B |
| Epochs | 1 |
| Batch size (per device) | 32 |
| Gradient accumulation steps | 4 |
| Learning rate | 2e-5 |
| LR scheduler | cosine |
| Warmup steps | 6,000 |
| Precision | fp16 |
| Best checkpoint | step 70,000 |
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("DewiBrynJones/m2m100_1.2B-ft-cy-to-en")
model = AutoModelForSeq2SeqLM.from_pretrained("DewiBrynJones/m2m100_1.2B-ft-cy-to-en")
tokenizer.src_lang = "cy"
inputs = tokenizer("Sut mae?", return_tensors="pt")
generated = model.generate(**inputs, forced_bos_token_id=tokenizer.get_lang_id("en"))
print(tokenizer.decode(generated[0], skip_special_tokens=True))
- Downloads last month
- 19
Model tree for DewiBrynJones/m2m100_1.2B-ft-cy-to-en
Base model
facebook/m2m100_1.2B