Kurdish NER with XLM-R

This is a fine-tuned xlm-roberta-base model for Named Entity Recognition (NER) in Kurmanji Kurdish. It was trained on a manually annotated dataset of 7,919 sentences covering news and other text sources. The model identifies the following entity types:

  • PER: Person
  • LOC: Location
  • ORG: Organization

Model Details

  • Base model: xlm-roberta-base (270M parameters)
  • Fine-tuning
    • Epochs: 5
    • Batch size: 8
    • Max seq length: 128 tokens
    • Optimizer: AdamW
    • Learning rate: 2e-5
    • Weight decay: 0.01

Intended Use

  • Extract named entities from Kurmanji Kurdish text (news, social media, etc.)
  • Aid in information extraction, digital humanities, and low-resource language research

Evaluation Metrics

Final held-out test set: 716 sentences (≈12k tokens)

Metric Value
Precision 0.8668
Recall 0.8803
F1 Score 0.8735

Try it Online

👉 Streamlit Demo
Paste a sentence in Kurmanji Kurdish (Latin script) and explore the model’s predictions in your browser.


How to Use

You can load and use the model via Hugging Face 🤗 Transformers:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_id = "akam-ot/ku-ner-xlmr"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)

ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

sentence = "Navê min Hejar e û ez li Hewlêr dijîm."
results = ner(sentence)

for ent in results:
    print(f"{ent['word']}{ent['entity_group']} (score: {ent['score']:.2f})")
Downloads last month
8
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for akam-ot/ku-ner-xlmr

Finetuned
(3880)
this model