Kurdish Sorani Spellchecker (ByT5-base)

A ByT5-base sequence-to-sequence model fine-tuned for automatic Kurdish Sorani spellchecking.

Model Description

This model takes a potentially misspelled Kurdish Sorani sentence as input and outputs the corrected version. It operates at the byte level using Google's ByT5 architecture, making it robust to the character-level variations common in Kurdish text.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_id = "akam-ot/kurdish-spellchecker-sorani-byt5"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

def spellcheck(text):
    inputs = tokenizer(text, return_tensors="pt")
    input_len = inputs["input_ids"].shape[1]
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=input_len + 20,
            num_beams=4,
            early_stopping=True,
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print(spellcheck("كوردستان ولاتێكە"))
# → کوردستان وڵاتێکە

Limitations

Optimized for Sorani Kurdish — not tested on Kurmanji or other dialects
Best on sentences up to ~128 tokens; longer texts should be split into sentences
May occasionally hallucinate on very short or ambiguous inputs

Author

Akam GitHub · HuggingFace

Downloads last month: 12

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for akam-ot/kurdish-spellchecker-sorani-byt5

Base model

google/byt5-base

Finetuned

(53)

this model