FLAN-T5 Base Dhivehi to Latin Transliteration Model

Model Description

This model is a fine-tuned version of alakxender/flan-t5-base-dhivehi-en-latin specifically trained for Dhivehi (Thaana script) to Latin script transliteration. The model was trained using 20,000 Thaana-Latin transliteration pairs scraped from Mihaaru News.

Training Details

The model was trained for 5 epochs with train loss 0.146100 and validation loss 0.155982. For detailed training information and access to the dataset, please refer to the GitHub repository.

Usage

from transformers import (
    AutoTokenizer,
    T5ForConditionalGeneration
)

MODEL_CHECKPOINT = "politecat314/flan-t5-base-dv2latin-mihaaru"

# Load the fine-tuned model and tokenizer
fine_tuned_tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)
fine_tuned_model = T5ForConditionalGeneration.from_pretrained(MODEL_CHECKPOINT)

# Example text
source_text = "އިންޑިއާގައި ފޭކް އެމްބަސީއެއް ހަދައިގެން އުޅުނު މީހަކު ހައްޔަރުކޮށްފި"
prompt = f"dv2latin: {source_text.strip()}"

# Generate translation
inputs = fine_tuned_tokenizer(prompt, return_tensors="pt")
output_ids = fine_tuned_model.generate(**inputs, max_length=128)
result = fine_tuned_tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(f"\nSource (Dhivehi): {source_text}")
print(f"Result (Latin): {result}")

# PRINTS
# Source (Dhivehi): އިންޑިއާގައި ފޭކް އެމްބަސީއެއް ހަދައިގެން އުޅުނު މީހަކު ހައްޔަރުކޮށްފި
# Result (Latin): India gai fake embassy eh hadhaigen ulhunu meehaku hayyaru koffi

Examples

Successful Transliterations

Input: މޯލްޑިވިއަންގެ ބޯޓުތަކުގައި ތާނައިން ރާއްޖޭގެ ނަން ލިޔެފި  
Output: Maldivian ge boat thakugai Thaanain Raajjeyge nan liyefi

Input: ޗައިނާގެ ގްރޫޕަކުން ސިންގަޕޫރަށް ވަރުގަދަ ސައިބާ ހަމަލާތަކެއް ދީފި  
Output: China ge group eh Singapore ah varugadha saba hamalaathakeh dheefi

Failure Case

Input: މެލޭޝިޔާގައި ބޮޑުވަޒީރުގެ އިސްތިއުފާއަށް ގޮވާ، ސަރުކާރާ ދެކޮޅަށް މުޒާހަރާ ކުރަނީ  
Output: sarukaaraa dhekolhah muzaaharaa kuranee

Note: This is a failure case where some words are missing from the transliteration.

Model Details

Base Model: alakxender/flan-t5-base-dhivehi-en-latin
Task: Text-to-text generation (transliteration)
Language: Dhivehi (Thaana) → Latin script
Training Data: 20K Thaana-Latin pairs from Mihaaru News
Architecture: T5 (Text-to-Text Transfer Transformer)

Limitations

As demonstrated in the failure case above, the model may occasionally miss or skip certain words during transliteration. Users should be aware of this limitation when using the model for critical applications.

Resources

Training Code & Dataset: GitHub Repository
Base Model: alakxender/flan-t5-base-dhivehi-en-latin
Data Source: Mihaaru News

Downloads last month: 7

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

politecat314
/

flan-t5-base-dv2latin-mihaaru