Kurdish NER with XLM-R

This is a fine-tuned xlm-roberta-base model for Named Entity Recognition (NER) in Kurmanji Kurdish. It was trained on a manually annotated dataset of 7,919 sentences covering news and other text sources. The model identifies the following entity types:

PER: Person
LOC: Location
ORG: Organization

Model Details

Base model: xlm-roberta-base (270M parameters)
Fine-tuning
- Epochs: 5
- Batch size: 8
- Max seq length: 128 tokens
- Optimizer: AdamW
- Learning rate: 2e-5
- Weight decay: 0.01

Intended Use

Extract named entities from Kurmanji Kurdish text (news, social media, etc.)
Aid in information extraction, digital humanities, and low-resource language research

Evaluation Metrics

Final held-out test set: 716 sentences (≈12k tokens)

Metric	Value
Precision	0.8668
Recall	0.8803
F1 Score	0.8735

Try it Online

👉 Streamlit Demo
Paste a sentence in Kurmanji Kurdish (Latin script) and explore the model’s predictions in your browser.

How to Use

You can load and use the model via Hugging Face 🤗 Transformers:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_id = "akam-ot/ku-ner-xlmr"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

sentence = "Navê min Hejar e û ez li Hewlêr dijîm."
results = ner(sentence)

for ent in results:
    print(f"{ent['word']} → {ent['entity_group']} (score: {ent['score']:.2f})")

Downloads last month: 8

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for akam-ot/ku-ner-xlmr

Base model

FacebookAI/xlm-roberta-base

Finetuned

(3880)

this model