upstream-archive byte-perfect snapshot of Helsinki-NLP/opus-mt-eo-caenes (ADR-039 Phase D)
c3f201b verified | language: | |
| - eo | |
| - en | |
| - es | |
| - ca | |
| tags: | |
| - translation | |
| - machine-translation | |
| - marian | |
| - opus-mt | |
| - multilingual | |
| license: cc-by-4.0 | |
| pipeline_tag: translation | |
| metrics: | |
| - bleu | |
| - chrf | |
| # Esperanto -> Catalan, English, Spanish MT Model | |
| ## Model description | |
| This repository contains a **multilingual MarianMT** model for **Esperanto → (English, Spanish, Catalan)** translation using language tags. | |
| ## Usage | |
| The model is loaded and used with `transformers` as: | |
| ```python | |
| from transformers import MarianMTModel, MarianTokenizer | |
| import torch | |
| model_name = "Helsinki-NLP/opus-mt-eo-caenes" | |
| device = "cuda" if torch.cuda.is_available() else "cpu" | |
| model = MarianMTModel.from_pretrained(model_name).to(device) | |
| tokenizer = MarianTokenizer.from_pretrained(model_name) | |
| source_texts = [ | |
| ">>spa<< Saluton, kiel vi fartas?", | |
| ">>eng<< Saluton, kiel vi fartas?", | |
| ">>cat<< Saluton, kiel vi fartas?" | |
| ] | |
| inputs = tokenizer(source_texts, return_tensors="pt", padding=True, truncation=True) | |
| inputs = {k: v.to(device) for k, v in inputs.items()} | |
| translated_ids = model.generate(inputs["input_ids"]) | |
| translated_texts = tokenizer.batch_decode(translated_ids, skip_special_tokens=True) | |
| for src, tgt in zip(source_texts, translated_texts): | |
| print(f"Source: {src} => Translated: {tgt}") | |
| ```` | |
| ### Supported target languages (via tags) | |
| You control the target language by prefixing the source sentence with one of the following tags: | |
| * `>>eng<<` → English | |
| * `>>spa<<` → Spanish | |
| * `>>cat<<` → Catalan | |
| ## Training data | |
| The model was trained using **Tatoeba** parallel data, with **FLORES-200** used as the development set. | |
| Training sentence-pair counts: | |
| * **ca-eo**: 672,931 | |
| * **es-eo**: 4,677,945 | |
| * **eo-en**: 5,000,000 | |
| ## Evaluation on FLORES | |
| | Language Pair | BLEU | ChrF++ | | |
| | ------------- | ----: | ----: | | |
| | epo-spa | 19.98 | 49.11 | | |
| | epo-cat | 28.35 | 55.42 | | |
| | epo-eng | 37.47 | 63.09 | | |