Dhivehi ByT5 Latin to Thaana (v1)

This model is a specialized Latin-to-Thaana transliteration model optimized for Maldives news media, but also capable of handling both formal journalistic text and casual social media typing.

It functions as a "Hybrid" model: it respects the casual spacing of chat messages while correctly applying formal grammatical rules (such as compounding verbs) when it detects a news context.

🧠 Training Strategy

  1. Base Model: google/byt5-small
  2. General Fine-tuning: Tuned on the alakxender/dhivehi-transliteration-pairs dataset (~150k pairs) to learn general phonetics and spelling.
  3. Domain Adaptation: Further fine-tuned on a high-quality dataset of 10k News headlines to recognize formal entities and apply correct grammatical spacing for official terms.

📊 Performance Samples

Category Latin Input Model Output
Formal Grammar Raeesul jumhooriyya... thasdheegu kuravvaifi ރައީސުލް ޖުމްހޫރިއްޔާ... ތަސްދީގުކުރައްވައިފި
Casual / Chat Aharen miadhu varah ban'du hai އަހަރެން މިއަދު ވަރަށް ބަނޑުހައި
Official Titles Minister of Foreign Affairs Moosa Zameer މިނިސްޓަރު އޮފް ފޮރިން އެފެއާސް މޫސަ ޒަމީރު
News Phrasing Police service in vanee ekan kuhveri kohfa ޕޮލިސް ސާވިސްއިން ވަނީ އެކަން ކުށްވެރިކޮށްފައި

💻 Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Neobe/dhivehi-byt5-latin2thaana-v1")
model = AutoModelForSeq2SeqLM.from_pretrained("Neobe/dhivehi-byt5-latin2thaana-v1")

text = "Raeesul jumhooriyya miadhu ganoonu thasdheegu kuravvaifi"
inputs = tokenizer(text, return_tensors="pt")

outputs = model.generate(**inputs, max_length=256)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Output: ރައީސުލް ޖުމްހޫރިއްޔާ މިއަދު ގާނޫނު ތަސްދީގުކުރައްވައިފި
Downloads last month
239
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Neobe/dhivehi-byt5-latin2thaana-v1

Finetuned
(229)
this model

Dataset used to train Neobe/dhivehi-byt5-latin2thaana-v1