Karez
/

KHLR

+---
+language:
+  - ckb
+  - ar
+  - ur
+license: cc-by-nc-4.0
+tags:
+  - handwritten-text-recognition
+  - kurdish
+  - arabic
+  - urdu
+  - densenet
+  - transformer
+  - pytorch
+  - safetensors
+datasets:
+  - DASTNUS
+  - KHATT
+  - PUCIT
+metrics:
+  - cer
+  - wer
+pipeline_tag: image-to-text
+---
+# KHLR: Kurdish Handwritten Line Recognition
+**A DenseNet121-Transformer Architecture with Constrained Synthetic Line Generation**
+This repository contains the source code, trained models, and vocabularies for Kurdish handwritten line recognition, with cross-dataset generalization to Arabic (KHATT) and Urdu (PUCIT) handwritten datasets.
+---
+## Repository Structure
+```
+KHLR/
+├── Kurdish-HLR-Model/      # Best Kurdish model (safetensors + config)
+├── Arabic-HLR-Model/         # Fine-tuned on KHATT Arabic dataset
+├── Urdu-HLR-Model/           # Fine-tuned on PUCIT Urdu dataset
+├── Scripts/
+│   ├── train.py                # Main training script
+│   ├── synthetic_line_generator.py  # Recipe-based synthetic line generation
+│   └── inference.py            # Single image / batch inference
+├── Sample/
+│   ├── sample_image.tif        # Example handwritten line image
+│   └── sample_image.txt        # Corresponding ground truth
+├── requirements.txt
+└── README.md
+```
+## Architecture
+| Component | Details |
+|-----------|---------|
+| CNN Backbone | DenseNet-121 (ImageNet pre-trained) |
+| Encoder | 3 Transformer encoder layers |
+| Decoder | 3 Transformer decoder layers |
+| Attention Heads | 8 |
+| Hidden Size | 256 |
+| Feed-Forward Dim | 1024 |
+| Total Parameters | ~12.8M |
+## Performance
+### Kurdish (DASTNUS)
+| Configuration | CER | WER | CRR (%) |
+|--------------|-----|-----|---------|
+| +AA+SKHL+FHL-50 | 0.0593 | 0.3083 | 94.07 |
+| +AA+SKHL+FHL-50 + 8-gram LM | 0.0534 | 0.2746 | 94.66 |
+### Cross-Dataset Generalization
+| Dataset | Language | CER | WER | CRR (%) |
+|---------|----------|-----|-----|---------|
+| KHATT | Arabic | 0.1135 | 0.4156 | 88.65 |
+| PUCIT | Urdu | 0.0932 | 0.2799 | 90.68 |
+## Installation
+```bash
+git clone https://huggingface.co/karez/KHLR
+cd KHLR
+pip install -r requirements.txt
+```
+## Quick Start
+### Inference
+```bash
+# Single image (using .pth checkpoint)
+python Scripts/inference.py \
+    --image Sample/sample_image.tif \
+    --model_path Kurdish-HLR-Model/model.safetensors \
+    --vocab_path Kurdish-HLR-Model/vocab.json
+# Directory of images
+python Scripts/inference.py \
+    --image_dir ./test_images \
+    --model_path Kurdish-HLR-Model/model.safetensors \
+    --vocab_path Kurdish-HLR-Model/vocab.json
+```
+### Training
+```bash
+# Basic training (unique handwritten lines only)
+python Scripts/train.py \
+    --data_dir ./data/DASTNUS \
+    --vocab_path Kurdish-HLR-Model/vocab.json
+# Full training with synthetic lines + writer mixing (best configuration)
+python Scripts/train.py \
+    --data_dir ./data/DASTNUS \
+    --vocab_path Kurdish-HLR-Model/vocab.json \
+    --use_synthetic \
+    --synthetic_dir ./data/Synthetic-Lines \
+    --use_writer_mixing \
+    --fixed_lines_dir ./data/Fixed-Lines \
+    --num_writers 50
+```
+### Synthetic Line Generation
+```bash
+python Scripts/synthetic_line_generator.py \
+    --unique_words_dir ./data/Unique-Words \
+    --person_names_dir ./data/Person-Names \
+    --output_dir ./data/Synthetic-Lines \
+    --training_writers ./writers/Training.txt \
+    --validation_writers ./writers/Validation.txt \
+    --testing_writers ./writers/Testing.txt
+```
+## Models
+| Model | Language | Vocabulary | Format |
+|-------|----------|-----------|--------|
+| Kurdish-HLR-Model | Kurdish (Sorani) | 114 tokens | safetensors |
+| Arabic-HLR-Model | Arabic | 192 tokens (unified) | safetensors |
+| Urdu-HLR-Model | Urdu | 192 tokens (unified) | safetensors |
+The Arabic and Urdu models use a triple unified vocabulary (Kurdish + Arabic + Urdu) enabling cross-script transfer learning.
+## Dataset
+The models were trained using the following subsets of the **DASTNUS** Kurdish handwritten dataset:
+| Data Source | Training | Validation | Testing |
+|-------------|----------|------------|---------|
+| Unique handwritten lines | 3,575 | 655 | 649 |
+| Synthetic handwritten lines | 3,762 | - | - |
+| Fixed-content lines (50 writers) | 512 | - | - |
+| **Total** | **7,849** | **655** | **649** |
+The data used in this research is available upon request for non-commercial scientific research purposes only.
+## Citation
+```bibtex
+[]
+```
+## License
+This repository is released for **non-commercial scientific research purposes only**.