marian-mt-en-ru-high-precision

This is a specialized version of the Helsinki-NLP/opus-mt-en-ru model, fine-tuned on a deeply cleaned parallel corpus. The primary focus of this version is semantic accuracy and translation purity.

Process Description

The model was trained on a selection of 4 million sentence pairs. The key feature of the data preparation was rigorous filtering using LaBSE semantic embeddings (with a similarity threshold > 0.8). This process eliminated "noisy" pairs, loose translations, and structural mismatches commonly found in standard open datasets.

As a result, the model demonstrates a stricter, "surgical" translation style that remains as close as possible to the original in both meaning and structure.

Final Metrics

Quality was verified on a hold-out test set of 1,000 pairs from the same data distribution, which the model did not encounter during training.

Metric Original Model High Precision Model Improvement
SacreBLEU 40.62 46.14 +5.52
COMET (wmt22-da) 0.8945 0.9046 +0.0101

Usage

from transformers import MarianMTModel, MarianTokenizer

model_name = "KvaytG/marian-mt-en-ru-high-precision"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Example of precise translation
text = "Workplace harmony is crucial, emphasizing group effort rather than individual accomplishments."
inputs = tokenizer(text, return_tensors="pt", padding=True)
output = model.generate(**inputs)
print(tokenizer.decode(output[0], skip_special_tokens=True))

License

This model is released under the Apache License 2.0.

Citation

@misc{kvaytg_marian_mt_en_ru_high_precision,
  author       = {KvaytG},
  title        = {High-precision English-Russian MarianMT model},
  year         = {2026},
  publisher    = {Hugging Face},
  journal      = {Hugging Face Models},
  url          = {https://huggingface.co/KvaytG/marian-mt-en-ru-high-precision},
  note         = {Fine-tuned on 4 million LaBSE-filtered (>0.8) sentence pairs from the en-ru-parallel-20m corpus. Base model: Helsinki-NLP/opus-mt-en-ru.}
}
Downloads last month
302
Safetensors
Model size
76.2M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for KvaytG/marian-mt-en-ru-high-precision

Finetuned
(41)
this model
Finetunes
2 models

Dataset used to train KvaytG/marian-mt-en-ru-high-precision

Space using KvaytG/marian-mt-en-ru-high-precision 1