FLAN-T5 Base Dhivehi to Latin Transliteration Model

Model Description

This model is a fine-tuned version of alakxender/flan-t5-base-dhivehi-en-latin specifically trained for Dhivehi (Thaana script) to Latin script transliteration. The model was trained using 20,000 Thaana-Latin transliteration pairs scraped from Mihaaru News.

Training Details

The model was trained for 5 epochs with train loss 0.146100 and validation loss 0.155982. For detailed training information and access to the dataset, please refer to the GitHub repository.

Usage

from transformers import (
    AutoTokenizer,
    T5ForConditionalGeneration
)

MODEL_CHECKPOINT = "politecat314/flan-t5-base-dv2latin-mihaaru"

# Load the fine-tuned model and tokenizer
fine_tuned_tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)
fine_tuned_model = T5ForConditionalGeneration.from_pretrained(MODEL_CHECKPOINT)

# Example text
source_text = "އިންޑިއާގައި ފޭކް އެމްބަސީއެއް ހަދައިގެން އުޅުނު މީހަކު ހައްޔަރުކޮށްފި"
prompt = f"dv2latin: {source_text.strip()}"

# Generate translation
inputs = fine_tuned_tokenizer(prompt, return_tensors="pt")
output_ids = fine_tuned_model.generate(**inputs, max_length=128)
result = fine_tuned_tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(f"\nSource (Dhivehi): {source_text}")
print(f"Result (Latin): {result}")

# PRINTS
# Source (Dhivehi): އިންޑިއާގައި ފޭކް އެމްބަސީއެއް ހަދައިގެން އުޅުނު މީހަކު ހައްޔަރުކޮށްފި
# Result (Latin): India gai fake embassy eh hadhaigen ulhunu meehaku hayyaru koffi

Examples

Successful Transliterations

Input: މޯލްޑިވިއަންގެ ބޯޓުތަކުގައި ތާނައިން ރާއްޖޭގެ ނަން ލިޔެފި  
Output: Maldivian ge boat thakugai Thaanain Raajjeyge nan liyefi

Input: ޗައިނާގެ ގްރޫޕަކުން ސިންގަޕޫރަށް ވަރުގަދަ ސައިބާ ހަމަލާތަކެއް ދީފި  
Output: China ge group eh Singapore ah varugadha saba hamalaathakeh dheefi

Failure Case

Input: މެލޭޝިޔާގައި ބޮޑުވަޒީރުގެ އިސްތިއުފާއަށް ގޮވާ، ސަރުކާރާ ދެކޮޅަށް މުޒާހަރާ ކުރަނީ  
Output: sarukaaraa dhekolhah muzaaharaa kuranee

Note: This is a failure case where some words are missing from the transliteration.

Model Details

  • Base Model: alakxender/flan-t5-base-dhivehi-en-latin
  • Task: Text-to-text generation (transliteration)
  • Language: Dhivehi (Thaana) → Latin script
  • Training Data: 20K Thaana-Latin pairs from Mihaaru News
  • Architecture: T5 (Text-to-Text Transfer Transformer)

Limitations

As demonstrated in the failure case above, the model may occasionally miss or skip certain words during transliteration. Users should be aware of this limitation when using the model for critical applications.

Resources

Downloads last month
7
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using politecat314/flan-t5-base-dv2latin-mihaaru 1