Karez commited on
Commit
1bf8c0b
Β·
verified Β·
1 Parent(s): 042bc54

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +168 -0
README.md ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ckb
4
+ - ar
5
+ - ur
6
+ license: cc-by-nc-4.0
7
+ tags:
8
+ - handwritten-text-recognition
9
+ - kurdish
10
+ - arabic
11
+ - urdu
12
+ - densenet
13
+ - transformer
14
+ - pytorch
15
+ - safetensors
16
+ datasets:
17
+ - DASTNUS
18
+ - KHATT
19
+ - PUCIT
20
+ metrics:
21
+ - cer
22
+ - wer
23
+ pipeline_tag: image-to-text
24
+ ---
25
+
26
+ # KHLR: Kurdish Handwritten Line Recognition
27
+
28
+ **A DenseNet121-Transformer Architecture with Constrained Synthetic Line Generation**
29
+
30
+ This repository contains the source code, trained models, and vocabularies for Kurdish handwritten line recognition, with cross-dataset generalization to Arabic (KHATT) and Urdu (PUCIT) handwritten datasets.
31
+
32
+ ---
33
+
34
+ ## Repository Structure
35
+
36
+ ```
37
+ KHLR/
38
+ β”œβ”€β”€ Kurdish-HLR-Model/ # Best Kurdish model (safetensors + config)
39
+ β”œβ”€β”€ Arabic-HLR-Model/ # Fine-tuned on KHATT Arabic dataset
40
+ β”œβ”€β”€ Urdu-HLR-Model/ # Fine-tuned on PUCIT Urdu dataset
41
+ β”œβ”€β”€ Scripts/
42
+ β”‚ β”œβ”€β”€ train.py # Main training script
43
+ β”‚ β”œβ”€β”€ synthetic_line_generator.py # Recipe-based synthetic line generation
44
+ β”‚ └── inference.py # Single image / batch inference
45
+ β”œβ”€β”€ Sample/
46
+ β”‚ β”œβ”€β”€ sample_image.tif # Example handwritten line image
47
+ β”‚ └── sample_image.txt # Corresponding ground truth
48
+ β”œβ”€β”€ requirements.txt
49
+ └── README.md
50
+ ```
51
+
52
+ ## Architecture
53
+
54
+ | Component | Details |
55
+ |-----------|---------|
56
+ | CNN Backbone | DenseNet-121 (ImageNet pre-trained) |
57
+ | Encoder | 3 Transformer encoder layers |
58
+ | Decoder | 3 Transformer decoder layers |
59
+ | Attention Heads | 8 |
60
+ | Hidden Size | 256 |
61
+ | Feed-Forward Dim | 1024 |
62
+ | Total Parameters | ~12.8M |
63
+
64
+ ## Performance
65
+
66
+ ### Kurdish (DASTNUS)
67
+
68
+ | Configuration | CER | WER | CRR (%) |
69
+ |--------------|-----|-----|---------|
70
+ | +AA+SKHL+FHL-50 | 0.0593 | 0.3083 | 94.07 |
71
+ | +AA+SKHL+FHL-50 + 8-gram LM | 0.0534 | 0.2746 | 94.66 |
72
+
73
+ ### Cross-Dataset Generalization
74
+
75
+ | Dataset | Language | CER | WER | CRR (%) |
76
+ |---------|----------|-----|-----|---------|
77
+ | KHATT | Arabic | 0.1135 | 0.4156 | 88.65 |
78
+ | PUCIT | Urdu | 0.0932 | 0.2799 | 90.68 |
79
+
80
+ ## Installation
81
+
82
+ ```bash
83
+ git clone https://huggingface.co/karez/KHLR
84
+ cd KHLR
85
+ pip install -r requirements.txt
86
+ ```
87
+
88
+ ## Quick Start
89
+
90
+ ### Inference
91
+
92
+ ```bash
93
+ # Single image (using .pth checkpoint)
94
+ python Scripts/inference.py \
95
+ --image Sample/sample_image.tif \
96
+ --model_path Kurdish-HLR-Model/model.safetensors \
97
+ --vocab_path Kurdish-HLR-Model/vocab.json
98
+
99
+ # Directory of images
100
+ python Scripts/inference.py \
101
+ --image_dir ./test_images \
102
+ --model_path Kurdish-HLR-Model/model.safetensors \
103
+ --vocab_path Kurdish-HLR-Model/vocab.json
104
+ ```
105
+
106
+ ### Training
107
+
108
+ ```bash
109
+ # Basic training (unique handwritten lines only)
110
+ python Scripts/train.py \
111
+ --data_dir ./data/DASTNUS \
112
+ --vocab_path Kurdish-HLR-Model/vocab.json
113
+
114
+ # Full training with synthetic lines + writer mixing (best configuration)
115
+ python Scripts/train.py \
116
+ --data_dir ./data/DASTNUS \
117
+ --vocab_path Kurdish-HLR-Model/vocab.json \
118
+ --use_synthetic \
119
+ --synthetic_dir ./data/Synthetic-Lines \
120
+ --use_writer_mixing \
121
+ --fixed_lines_dir ./data/Fixed-Lines \
122
+ --num_writers 50
123
+ ```
124
+
125
+ ### Synthetic Line Generation
126
+
127
+ ```bash
128
+ python Scripts/synthetic_line_generator.py \
129
+ --unique_words_dir ./data/Unique-Words \
130
+ --person_names_dir ./data/Person-Names \
131
+ --output_dir ./data/Synthetic-Lines \
132
+ --training_writers ./writers/Training.txt \
133
+ --validation_writers ./writers/Validation.txt \
134
+ --testing_writers ./writers/Testing.txt
135
+ ```
136
+
137
+ ## Models
138
+
139
+ | Model | Language | Vocabulary | Format |
140
+ |-------|----------|-----------|--------|
141
+ | Kurdish-HLR-Model | Kurdish (Sorani) | 114 tokens | safetensors |
142
+ | Arabic-HLR-Model | Arabic | 192 tokens (unified) | safetensors |
143
+ | Urdu-HLR-Model | Urdu | 192 tokens (unified) | safetensors |
144
+
145
+ The Arabic and Urdu models use a triple unified vocabulary (Kurdish + Arabic + Urdu) enabling cross-script transfer learning.
146
+
147
+ ## Dataset
148
+
149
+ The models were trained using the following subsets of the **DASTNUS** Kurdish handwritten dataset:
150
+
151
+ | Data Source | Training | Validation | Testing |
152
+ |-------------|----------|------------|---------|
153
+ | Unique handwritten lines | 3,575 | 655 | 649 |
154
+ | Synthetic handwritten lines | 3,762 | - | - |
155
+ | Fixed-content lines (50 writers) | 512 | - | - |
156
+ | **Total** | **7,849** | **655** | **649** |
157
+
158
+ The data used in this research is available upon request for non-commercial scientific research purposes only.
159
+
160
+ ## Citation
161
+
162
+ ```bibtex
163
+ []
164
+ ```
165
+
166
+ ## License
167
+
168
+ This repository is released for **non-commercial scientific research purposes only**.