Dhivehi ByT5 Latin to Thaana (v1)
This model is a specialized Latin-to-Thaana transliteration model optimized for Maldives news media, but also capable of handling both formal journalistic text and casual social media typing.
It functions as a "Hybrid" model: it respects the casual spacing of chat messages while correctly applying formal grammatical rules (such as compounding verbs) when it detects a news context.
🧠 Training Strategy
- Base Model: google/byt5-small
- General Fine-tuning: Tuned on the alakxender/dhivehi-transliteration-pairs dataset (~150k pairs) to learn general phonetics and spelling.
- Domain Adaptation: Further fine-tuned on a high-quality dataset of 10k News headlines to recognize formal entities and apply correct grammatical spacing for official terms.
📊 Performance Samples
| Category | Latin Input | Model Output |
|---|---|---|
| Formal Grammar | Raeesul jumhooriyya... thasdheegu kuravvaifi |
ރައީސުލް ޖުމްހޫރިއްޔާ... ތަސްދީގުކުރައްވައިފި |
| Casual / Chat | Aharen miadhu varah ban'du hai |
އަހަރެން މިއަދު ވަރަށް ބަނޑުހައި |
| Official Titles | Minister of Foreign Affairs Moosa Zameer |
މިނިސްޓަރު އޮފް ފޮރިން އެފެއާސް މޫސަ ޒަމީރު |
| News Phrasing | Police service in vanee ekan kuhveri kohfa |
ޕޮލިސް ސާވިސްއިން ވަނީ އެކަން ކުށްވެރިކޮށްފައި |
💻 Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Neobe/dhivehi-byt5-latin2thaana-v1")
model = AutoModelForSeq2SeqLM.from_pretrained("Neobe/dhivehi-byt5-latin2thaana-v1")
text = "Raeesul jumhooriyya miadhu ganoonu thasdheegu kuravvaifi"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Output: ރައީސުލް ޖުމްހޫރިއްޔާ މިއަދު ގާނޫނު ތަސްދީގުކުރައްވައިފި
- Downloads last month
- 239
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for Neobe/dhivehi-byt5-latin2thaana-v1
Base model
google/byt5-small