Kurdish Sorani Spellchecker (ByT5-base)
A ByT5-base sequence-to-sequence model fine-tuned for automatic Kurdish Sorani spellchecking.
Model Description
This model takes a potentially misspelled Kurdish Sorani sentence as input and outputs the corrected version. It operates at the byte level using Google's ByT5 architecture, making it robust to the character-level variations common in Kurdish text.
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
model_id = "akam-ot/kurdish-spellchecker-sorani-byt5"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
def spellcheck(text):
inputs = tokenizer(text, return_tensors="pt")
input_len = inputs["input_ids"].shape[1]
with torch.no_grad():
outputs = model.generate(
**inputs,
max_length=input_len + 20,
num_beams=4,
early_stopping=True,
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
print(spellcheck("كوردستان ولاتێكە"))
# → کوردستان وڵاتێکە
Limitations
- Optimized for Sorani Kurdish — not tested on Kurmanji or other dialects
- Best on sentences up to ~128 tokens; longer texts should be split into sentences
- May occasionally hallucinate on very short or ambiguous inputs
Author
Akam GitHub · HuggingFace
- Downloads last month
- 12
Model tree for akam-ot/kurdish-spellchecker-sorani-byt5
Base model
google/byt5-base