--- language: - ckb - ar - ur license: cc-by-nc-4.0 tags: - handwritten-text-recognition - kurdish - arabic - urdu - densenet - transformer - pytorch - safetensors datasets: - DASTNUS - KHATT - PUCIT metrics: - cer - wer pipeline_tag: image-to-text --- # KHLR: Kurdish Handwritten Line Recognition **A DenseNet121-Transformer Architecture with Constrained Synthetic Line Generation** This repository contains the source code, trained models, and vocabularies for Kurdish handwritten line recognition, with cross-dataset generalization to Arabic (KHATT) and Urdu (PUCIT) handwritten datasets. --- ## Repository Structure ``` KHLR/ ├── Kurdish-HLR-Model/ # Best Kurdish model (safetensors + config) ├── Arabic-HLR-Model/ # Fine-tuned on KHATT Arabic dataset ├── Urdu-HLR-Model/ # Fine-tuned on PUCIT Urdu dataset ├── Scripts/ │ ├── train.py # Main training script │ ├── synthetic_line_generator.py # Recipe-based synthetic line generation │ └── inference.py # Single image / batch inference ├── Sample/ │ ├── sample_image.tif # Example handwritten line image │ └── sample_image.txt # Corresponding ground truth ├── requirements.txt └── README.md ``` ## Architecture | Component | Details | |-----------|---------| | CNN Backbone | DenseNet-121 (ImageNet pre-trained) | | Encoder | 3 Transformer encoder layers | | Decoder | 3 Transformer decoder layers | | Attention Heads | 8 | | Hidden Size | 256 | | Feed-Forward Dim | 1024 | | Total Parameters | ~12.8M | ## Performance ### Kurdish (DASTNUS) | Configuration | CER | WER | CRR (%) | |--------------|-----|-----|---------| | +AA+SKHL+FHL-50 | 0.0593 | 0.3083 | 94.07 | | +AA+SKHL+FHL-50 + 8-gram LM | 0.0534 | 0.2746 | 94.66 | ### Cross-Dataset Generalization | Dataset | Language | CER | WER | CRR (%) | |---------|----------|-----|-----|---------| | KHATT | Arabic | 0.1135 | 0.4156 | 88.65 | | PUCIT | Urdu | 0.0932 | 0.2799 | 90.68 | ## Installation ```bash git clone https://huggingface.co/karez/KHLR cd KHLR pip install -r requirements.txt ``` ## Quick Start ### Inference ```bash # Single image (using .pth checkpoint) python Scripts/inference.py \ --image Sample/sample_image.tif \ --model_path Kurdish-HLR-Model/model.safetensors \ --vocab_path Kurdish-HLR-Model/vocab.json # Directory of images python Scripts/inference.py \ --image_dir ./test_images \ --model_path Kurdish-HLR-Model/model.safetensors \ --vocab_path Kurdish-HLR-Model/vocab.json ``` ### Training ```bash # Basic training (unique handwritten lines only) python Scripts/train.py \ --data_dir ./data/DASTNUS \ --vocab_path Kurdish-HLR-Model/vocab.json # Full training with synthetic lines + writer mixing (best configuration) python Scripts/train.py \ --data_dir ./data/DASTNUS \ --vocab_path Kurdish-HLR-Model/vocab.json \ --use_synthetic \ --synthetic_dir ./data/Synthetic-Lines \ --use_writer_mixing \ --fixed_lines_dir ./data/Fixed-Lines \ --num_writers 50 ``` ### Synthetic Line Generation ```bash python Scripts/synthetic_line_generator.py \ --unique_words_dir ./data/Unique-Words \ --person_names_dir ./data/Person-Names \ --output_dir ./data/Synthetic-Lines \ --training_writers ./writers/Training.txt \ --validation_writers ./writers/Validation.txt \ --testing_writers ./writers/Testing.txt ``` ## Models | Model | Language | Vocabulary | Format | |-------|----------|-----------|--------| | Kurdish-HLR-Model | Kurdish (Sorani) | 114 tokens | safetensors | | Arabic-HLR-Model | Arabic | 192 tokens (unified) | safetensors | | Urdu-HLR-Model | Urdu | 192 tokens (unified) | safetensors | The Arabic and Urdu models use a triple unified vocabulary (Kurdish + Arabic + Urdu) enabling cross-script transfer learning. ## Dataset The models were trained using the following subsets of the **DASTNUS** Kurdish handwritten dataset: | Data Source | Training | Validation | Testing | |-------------|----------|------------|---------| | Unique handwritten lines | 3,575 | 655 | 649 | | Synthetic handwritten lines | 3,762 | - | - | | Fixed-content lines (50 writers) | 512 | - | - | | **Total** | **7,849** | **655** | **649** | The data used in this research is available upon request for non-commercial scientific research purposes only. ## Citation ```bibtex [] ``` ## License This repository is released for **non-commercial scientific research purposes only**.