Model Card for ViT5 Translation Model

A sequence-to-sequence translation model based on VietAI ViT5-base, fine-tuned for Vietnamese to English machine translation.
This model is intended for general-purpose translation tasks, both academic and production-oriented.


Model Details

Model Description

This model is an encoder–decoder Transformer designed for text-to-text generation tasks such as translation.
It is fine-tuned from VietAI/vit5-base, a pretrained multilingual T5 variant optimized for Vietnamese NLP tasks.

  • Developed by: tnguyen20604
  • Funded by [optional]: N/A
  • Shared by: tnguyen20604
  • Model type: Encoder–Decoder (Text-to-Text Transformer)
  • Language(s): Vietnamese, English
  • License: apache-2.0
  • Fine-tuned from: VietAI/vit5-base

Model Sources


Synthetic Data

In addition to real bilingual corpora, this model was also trained using a synthetic English–Vietnamese dataset created to improve translation robustness and domain diversity.

Synthetic Dataset

Synthetic Data Generation Process

Synthetic data was generated using a large language model (LLM) through a multi-step workflow:

  1. Prompting a large LLM (e.g., GPT or Qwen) to generate bilingual sentence pairs with:

    • correct semantic alignment
    • natural fluency in both languages
    • diverse sentence structures
    • a variety of writing styles (formal, casual, conversational, instructional)
  2. Filtering and Cleaning

    • removed low-quality or incomplete generations
    • removed hallucinated or irrelevant translations
    • used automatic scoring (LLM-based + heuristic rules) to ensure source–target consistency
    • deduplicated repeated patterns
  3. Formatting

    • normalized Unicode
    • standardized punctuation
    • removed extremely short or extremely long sentences
    • converted examples into a clean parallel translation format compatible with Seq2Seq training

Purpose of Synthetic Data

The synthetic dataset was introduced to:

  • improve generalization on unseen sentence structures
  • increase translation robustness across multiple domains
  • expand vocabulary coverage and idiomatic expressions
  • reduce overfitting to the original dataset

Uses

Direct Use

  • Machine translation Viet → Eng
  • Text rewriting (via Seq2Seq generation)
  • Academic NLP experiments
  • MT benchmarking

Downstream Use

  • Fine-tuning thêm cho domain-specific translation (y khoa, pháp lý, kỹ thuật)
  • Dùng trong pipeline dịch tự động hoặc chatbot đa ngữ

Out-of-Scope Use

  • Not suitable for evaluating legal or medical content.
  • Does not guarantee full accuracy for texts containing highly specialized terminology.
  • Not appropriate for processing sensitive data or personally identifiable information (PII).

Bias, Risks, and Limitations

  • The training data originates from open-source corpora, which may introduce stylistic or domain bias.
  • The model may produce incorrect translations in cases such as:
    • ambiguous sentences
    • culturally specific expressions
    • very long or structurally complex sentences
  • Some translations may lose nuance, tone, or contextual meaning.

Recommendations

Users should manually review the translations when used in professional, safety-critical, or high-importance scenarios.


How to Get Started

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "tnguyen20604/vit5-translation-vi2en-v2.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

text = "Tôi yêu học máy."
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
9
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tnguyen20604/vit5-translation-vi2en-v2.1

Base model

VietAI/vit5-base
Finetuned
(87)
this model

Dataset used to train tnguyen20604/vit5-translation-vi2en-v2.1