NagaNLP NER (XLM-RoBERTa)

NagaNLP-NER is a Named Entity Recognition model fine-tuned on the Nagamese (Naga Pidgin) language. It is based on XLM-RoBERTa and trained to identify entities such as Persons, Locations, Organizations, and Miscellaneous entities.

This model is part of the NagaNLP project, aiming to provide foundational NLP resources for the low-resource languages of Nagaland.

Model Details

Developer: Agniva Maiti
Base Architecture: XLM-RoBERTa Base
Task: Token Classification (NER)
Language: Nagamese (nag)
Dataset: agnivamaiti/naganlp-ner-annotated-corpus

Training Data

The model was fine-tuned on a manually annotated corpus containing 214 sentences (approx. 4,800 tokens).

Source: NagaNLP Conversational Corpus subset.
Tags: CoNLL-2003 format (PER, LOC, ORG, MISC).

Intended Use

This model is intended for:

Extracting entities from Nagamese text.
Benchmarking multilingual models (like XLM-R) on extremely low-resource creole languages.

How to Get Started

YouCan use this model with the Hugging Face pipeline:

from transformers import pipeline

# Load the pipeline
ner_pipeline = pipeline("ner", model="agnivamaiti/naganlp-ner", aggregation_strategy="simple")

# Inference
text = "Etu retreating monsoon normally October mahina start hoi."
results = ner_pipeline(text)

# Print results
for entity in results:
    print(entity)
# Expected Output: {'entity_group': 'MISC', 'word': 'monsoon', ...}, {'entity_group': 'MISC', 'word': 'October', ...}

Limitations

Data Scarcity: Trained on a very small dataset (214 sentences). It serves as a baseline proof-of-concept and may struggle with vocabulary not seen during training.
Generalization: May perform poorly on dialects significantly different from the training corpus (Kohima/Dimapur standard).