Model Card for ViT5 Translation Model
A sequence-to-sequence translation model based on VietAI ViT5-base, fine-tuned for Vietnamese to English machine translation.
This model is intended for general-purpose translation tasks, both academic and production-oriented.
Model Details
Model Description
This model is an encoder–decoder Transformer designed for text-to-text generation tasks such as translation.
It is fine-tuned from VietAI/vit5-base, a pretrained multilingual T5 variant optimized for Vietnamese NLP tasks.
- Developed by: tnguyen20604
- Funded by [optional]: N/A
- Shared by: tnguyen20604
- Model type: Encoder–Decoder (Text-to-Text Transformer)
- Language(s): Vietnamese, English
- License: apache-2.0
- Fine-tuned from: VietAI/vit5-base
Model Sources
- Repository: https://huggingface.co/tnguyen20604/vit5-translation-vi2en-v2 .1
- Base Model: https://huggingface.co/VietAI/vit5-base
- Dataset: https://huggingface.co/datasets/Eugenememe/mix-en-vi-500k
- Synthetic Dataset: https://www.kaggle.com/datasets/nguyentran20604/synthentic-data
Synthetic Data
In addition to real bilingual corpora, this model was also trained using a synthetic English–Vietnamese dataset created to improve translation robustness and domain diversity.
Synthetic Dataset
- Source: https://www.kaggle.com/datasets/nguyentran20604/synthentic-data
- Contains machine-generated Vietnamese to English parallel sentences.
- Designed to supplement low-resource Vietnamese MT tasks and expand linguistic diversity.
Synthetic Data Generation Process
Synthetic data was generated using a large language model (LLM) through a multi-step workflow:
Prompting a large LLM (e.g., GPT or Qwen) to generate bilingual sentence pairs with:
- correct semantic alignment
- natural fluency in both languages
- diverse sentence structures
- a variety of writing styles (formal, casual, conversational, instructional)
Filtering and Cleaning
- removed low-quality or incomplete generations
- removed hallucinated or irrelevant translations
- used automatic scoring (LLM-based + heuristic rules) to ensure source–target consistency
- deduplicated repeated patterns
Formatting
- normalized Unicode
- standardized punctuation
- removed extremely short or extremely long sentences
- converted examples into a clean parallel translation format compatible with Seq2Seq training
Purpose of Synthetic Data
The synthetic dataset was introduced to:
- improve generalization on unseen sentence structures
- increase translation robustness across multiple domains
- expand vocabulary coverage and idiomatic expressions
- reduce overfitting to the original dataset
Uses
Direct Use
- Machine translation Viet → Eng
- Text rewriting (via Seq2Seq generation)
- Academic NLP experiments
- MT benchmarking
Downstream Use
- Fine-tuning thêm cho domain-specific translation (y khoa, pháp lý, kỹ thuật)
- Dùng trong pipeline dịch tự động hoặc chatbot đa ngữ
Out-of-Scope Use
- Not suitable for evaluating legal or medical content.
- Does not guarantee full accuracy for texts containing highly specialized terminology.
- Not appropriate for processing sensitive data or personally identifiable information (PII).
Bias, Risks, and Limitations
- The training data originates from open-source corpora, which may introduce stylistic or domain bias.
- The model may produce incorrect translations in cases such as:
- ambiguous sentences
- culturally specific expressions
- very long or structurally complex sentences
- Some translations may lose nuance, tone, or contextual meaning.
Recommendations
Users should manually review the translations when used in professional, safety-critical, or high-importance scenarios.
How to Get Started
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "tnguyen20604/vit5-translation-vi2en-v2.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
text = "Tôi yêu học máy."
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 9
Model tree for tnguyen20604/vit5-translation-vi2en-v2.1
Base model
VietAI/vit5-base