Kurdish NER with XLM-R
This is a fine-tuned xlm-roberta-base model for Named Entity Recognition (NER) in Kurmanji Kurdish. It was trained on a manually annotated dataset of 7,919 sentences covering news and other text sources. The model identifies the following entity types:
- PER: Person
- LOC: Location
- ORG: Organization
Model Details
- Base model:
xlm-roberta-base(270M parameters) - Fine-tuning
- Epochs: 5
- Batch size: 8
- Max seq length: 128 tokens
- Optimizer: AdamW
- Learning rate: 2e-5
- Weight decay: 0.01
Intended Use
- Extract named entities from Kurmanji Kurdish text (news, social media, etc.)
- Aid in information extraction, digital humanities, and low-resource language research
Evaluation Metrics
Final held-out test set: 716 sentences (≈12k tokens)
| Metric | Value |
|---|---|
| Precision | 0.8668 |
| Recall | 0.8803 |
| F1 Score | 0.8735 |
Try it Online
👉 Streamlit Demo
Paste a sentence in Kurmanji Kurdish (Latin script) and explore the model’s predictions in your browser.
How to Use
You can load and use the model via Hugging Face 🤗 Transformers:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_id = "akam-ot/ku-ner-xlmr"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id)
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
sentence = "Navê min Hejar e û ez li Hewlêr dijîm."
results = ner(sentence)
for ent in results:
print(f"{ent['word']} → {ent['entity_group']} (score: {ent['score']:.2f})")
- Downloads last month
- 8
Model tree for akam-ot/ku-ner-xlmr
Base model
FacebookAI/xlm-roberta-base