Model Card for ViT5 Translation Model

A sequence-to-sequence translation model based on VietAI ViT5-base, fine-tuned for Vietnamese to English machine translation.
This model is intended for general-purpose translation tasks, both academic and production-oriented.

Model Details

Model Description

This model is an encoder–decoder Transformer designed for text-to-text generation tasks such as translation.
It is fine-tuned from VietAI/vit5-base, a pretrained multilingual T5 variant optimized for Vietnamese NLP tasks.

Developed by: tnguyen20604
Funded by [optional]: N/A
Shared by: tnguyen20604
Model type: Encoder–Decoder (Text-to-Text Transformer)
Language(s): Vietnamese, English
License: apache-2.0
Fine-tuned from: VietAI/vit5-base

Model Sources

Repository: https://huggingface.co/tnguyen20604/vit5-translation-vi2en-v2 .1
Base Model: https://huggingface.co/VietAI/vit5-base
Dataset: https://huggingface.co/datasets/Eugenememe/mix-en-vi-500k
Synthetic Dataset: https://www.kaggle.com/datasets/nguyentran20604/synthentic-data

Synthetic Data

In addition to real bilingual corpora, this model was also trained using a synthetic English–Vietnamese dataset created to improve translation robustness and domain diversity.

Synthetic Dataset

Source: https://www.kaggle.com/datasets/nguyentran20604/synthentic-data
Contains machine-generated Vietnamese to English parallel sentences.
Designed to supplement low-resource Vietnamese MT tasks and expand linguistic diversity.

Synthetic Data Generation Process

Synthetic data was generated using a large language model (LLM) through a multi-step workflow:

Prompting a large LLM (e.g., GPT or Qwen) to generate bilingual sentence pairs with:
- correct semantic alignment
- natural fluency in both languages
- diverse sentence structures
- a variety of writing styles (formal, casual, conversational, instructional)
Filtering and Cleaning
- removed low-quality or incomplete generations
- removed hallucinated or irrelevant translations
- used automatic scoring (LLM-based + heuristic rules) to ensure source–target consistency
- deduplicated repeated patterns
Formatting
- normalized Unicode
- standardized punctuation
- removed extremely short or extremely long sentences
- converted examples into a clean parallel translation format compatible with Seq2Seq training

Purpose of Synthetic Data

The synthetic dataset was introduced to:

improve generalization on unseen sentence structures
increase translation robustness across multiple domains
expand vocabulary coverage and idiomatic expressions
reduce overfitting to the original dataset

Uses

Direct Use

Machine translation Viet → Eng
Text rewriting (via Seq2Seq generation)
Academic NLP experiments
MT benchmarking

Downstream Use

Fine-tuning thêm cho domain-specific translation (y khoa, pháp lý, kỹ thuật)
Dùng trong pipeline dịch tự động hoặc chatbot đa ngữ

Out-of-Scope Use

Not suitable for evaluating legal or medical content.
Does not guarantee full accuracy for texts containing highly specialized terminology.
Not appropriate for processing sensitive data or personally identifiable information (PII).

Bias, Risks, and Limitations

The training data originates from open-source corpora, which may introduce stylistic or domain bias.
The model may produce incorrect translations in cases such as:
- ambiguous sentences
- culturally specific expressions
- very long or structurally complex sentences
Some translations may lose nuance, tone, or contextual meaning.

Recommendations

Users should manually review the translations when used in professional, safety-critical, or high-importance scenarios.

How to Get Started

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "tnguyen20604/vit5-translation-vi2en-v2.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

text = "Tôi yêu học máy."
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Downloads last month: 9

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for tnguyen20604/vit5-translation-vi2en-v2.1

Base model

VietAI/vit5-base

Finetuned

(87)

this model

tnguyen20604
/

vit5-translation-vi2en-v2.1