Nyishi OCR

OCR model for the Nyishi language of Arunachal Pradesh, India. Developed by MWire Labs as part of the Northeast India OCR initiative.

Model Details

  • Architecture: DocTR ViTSTR-Base (85.3M parameters)
  • Script: Latin
  • Language: Nyishi (Nishi), spoken by ~300,000 people in Arunachal Pradesh
  • License: CC-BY-4.0

Performance

Split Char Accuracy
Validation 94.60%
Test 95.51%

Training Data

Fine-tuned on an 84k unique mix of synthetic and 5k curated images. Synthetic generated images used 21 fonts with augmentation (blur, noise, rotation, brightness variation).

Usage

from doctr.models import vitstr_base
import torch, json

charset = json.load(open("nyishi_charset.json"))["charset"]
model = vitstr_base(pretrained=False, vocab=charset)
model.load_state_dict(torch.load("nyishi_doctr_best.pt"))
model.eval()

Citation

If you use this model, please cite:

@misc{mwirelabs2026nyishiocr,
  title={NyishiOCR: First OCR Model for the Nyishi Language},
  author={MWire Labs},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/MWirelabs/nyishi-ocr}
}

About MWire Labs

MWire Labs is an AI research organization based in Shillong, Meghalaya, India, specializing in language technology for Northeast India's indigenous languages.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results