FLAN-T5 Base Dhivehi to Latin Transliteration Model
Model Description
This model is a fine-tuned version of alakxender/flan-t5-base-dhivehi-en-latin specifically trained for Dhivehi (Thaana script) to Latin script transliteration. The model was trained using 20,000 Thaana-Latin transliteration pairs scraped from Mihaaru News.
Training Details
The model was trained for 5 epochs with train loss 0.146100 and validation loss 0.155982. For detailed training information and access to the dataset, please refer to the GitHub repository.
Usage
from transformers import (
AutoTokenizer,
T5ForConditionalGeneration
)
MODEL_CHECKPOINT = "politecat314/flan-t5-base-dv2latin-mihaaru"
# Load the fine-tuned model and tokenizer
fine_tuned_tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)
fine_tuned_model = T5ForConditionalGeneration.from_pretrained(MODEL_CHECKPOINT)
# Example text
source_text = "އިންޑިއާގައި ފޭކް އެމްބަސީއެއް ހަދައިގެން އުޅުނު މީހަކު ހައްޔަރުކޮށްފި"
prompt = f"dv2latin: {source_text.strip()}"
# Generate translation
inputs = fine_tuned_tokenizer(prompt, return_tensors="pt")
output_ids = fine_tuned_model.generate(**inputs, max_length=128)
result = fine_tuned_tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(f"\nSource (Dhivehi): {source_text}")
print(f"Result (Latin): {result}")
# PRINTS
# Source (Dhivehi): އިންޑިއާގައި ފޭކް އެމްބަސީއެއް ހަދައިގެން އުޅުނު މީހަކު ހައްޔަރުކޮށްފި
# Result (Latin): India gai fake embassy eh hadhaigen ulhunu meehaku hayyaru koffi
Examples
Successful Transliterations
Input: މޯލްޑިވިއަންގެ ބޯޓުތަކުގައި ތާނައިން ރާއްޖޭގެ ނަން ލިޔެފި
Output: Maldivian ge boat thakugai Thaanain Raajjeyge nan liyefi
Input: ޗައިނާގެ ގްރޫޕަކުން ސިންގަޕޫރަށް ވަރުގަދަ ސައިބާ ހަމަލާތަކެއް ދީފި
Output: China ge group eh Singapore ah varugadha saba hamalaathakeh dheefi
Failure Case
Input: މެލޭޝިޔާގައި ބޮޑުވަޒީރުގެ އިސްތިއުފާއަށް ގޮވާ، ސަރުކާރާ ދެކޮޅަށް މުޒާހަރާ ކުރަނީ
Output: sarukaaraa dhekolhah muzaaharaa kuranee
Note: This is a failure case where some words are missing from the transliteration.
Model Details
- Base Model: alakxender/flan-t5-base-dhivehi-en-latin
- Task: Text-to-text generation (transliteration)
- Language: Dhivehi (Thaana) → Latin script
- Training Data: 20K Thaana-Latin pairs from Mihaaru News
- Architecture: T5 (Text-to-Text Transfer Transformer)
Limitations
As demonstrated in the failure case above, the model may occasionally miss or skip certain words during transliteration. Users should be aware of this limitation when using the model for critical applications.
Resources
- Training Code & Dataset: GitHub Repository
- Base Model: alakxender/flan-t5-base-dhivehi-en-latin
- Data Source: Mihaaru News
- Downloads last month
- 7
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support