Karez commited on
Commit
5928fb8
·
verified ·
1 Parent(s): 143b612

Upload folder using huggingface_hub

Browse files
Urdu-HLR-Model/README.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ur
4
+ license: cc-by-nc-4.0
5
+ tags:
6
+ - handwritten-text-recognition
7
+ - urdu
8
+ - pucit
9
+ - densenet
10
+ - transformer
11
+ - transfer-learning
12
+ - pytorch
13
+ - safetensors
14
+ datasets:
15
+ - PUCIT
16
+ - DASTNUS
17
+ metrics:
18
+ - cer
19
+ - wer
20
+ pipeline_tag: image-to-text
21
+ ---
22
+
23
+ # Urdu Handwritten Text Recognition: DenseNet121-Transformer (Fine-tuned on PUCIT)
24
+
25
+ ## Model Description
26
+ A lightweight DenseNet121-Transformer architecture for Urdu handwritten line recognition,
27
+ pre-trained on the Kurdish DASTNUS dataset and fine-tuned on the PUCIT Urdu handwritten dataset.
28
+ Uses a triple unified vocabulary covering Kurdish, Arabic, and Urdu scripts (192 tokens).
29
+
30
+ ## Architecture
31
+ - **CNN Backbone:** DenseNet-121 (pretrained on ImageNet)
32
+ - **Encoder:** 3 Transformer encoder layers
33
+ - **Decoder:** 3 Transformer decoder layers
34
+ - **Attention Heads:** 8
35
+ - **Hidden Size:** 256
36
+ - **Parameters:** ~12.8M
37
+ - **Vocabulary:** 192 tokens (Triple unified: Kurdish + Arabic + Urdu)
38
+
39
+ ## Transfer Learning Pipeline
40
+ 1. Pre-trained on Kurdish DASTNUS dataset (with unified vocabulary)
41
+ 2. Fine-tuned on PUCIT Urdu handwritten line dataset
42
+
43
+ ## Performance on PUCIT Test Set
44
+ | Metric | Value |
45
+ |--------|-------|
46
+ | CER | 0.0932 |
47
+ | WER | 0.2799 |
48
+ | CRR | 90.68% |
49
+
50
+ ## Training Data
51
+ - **Pre-training:** DASTNUS Kurdish handwritten dataset
52
+ - **Fine-tuning:** PUCIT Urdu handwritten dataset (5,554 training, 935 validation, 912 testing)
53
+
54
+ ## Usage
55
+ ```python
56
+ from safetensors.torch import load_file
57
+ import json
58
+
59
+ # Load model weights
60
+ state_dict = load_file("model.safetensors")
61
+
62
+ # Load config
63
+ with open("config.json", "r") as f:
64
+ config = json.load(f)
65
+
66
+ # Load vocabulary
67
+ with open("vocab.json", "r", encoding="utf-8") as f:
68
+ vocab = json.load(f)
69
+
70
+ # Load full unified vocabulary info
71
+ with open("unified_vocabulary.json", "r", encoding="utf-8") as f:
72
+ unified_vocab = json.load(f)
73
+ ```
74
+
75
+ ## Citation
76
+ []
77
+
78
+ ## License
79
+ This model is released for non-commercial scientific research purposes only.
Urdu-HLR-Model/config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architecture": "DenseNet121-Transformer",
3
+ "model_type": "custom",
4
+ "task": "handwritten-text-recognition",
5
+ "language": "Urdu",
6
+ "script": "Arabic",
7
+ "transfer_learning": "Kurdish (DASTNUS) → Urdu (PUCIT)",
8
+ "vocabulary": "Triple unified (Kurdish + Arabic + Urdu)",
9
+ "hidden_size": 256,
10
+ "num_encoder_layers": 3,
11
+ "num_decoder_layers": 3,
12
+ "num_attention_heads": 8,
13
+ "feed_forward_dim": 1024,
14
+ "dropout": 0.4,
15
+ "vocab_size": 192,
16
+ "max_sequence_length": 150,
17
+ "image_height": 96,
18
+ "image_width": 1235,
19
+ "cnn_backbone": "densenet121",
20
+ "parameters": 14209145,
21
+ "training": {
22
+ "best_val_cer": 0.061720455254131584,
23
+ "best_val_loss": null,
24
+ "best_epoch": 74,
25
+ "optimizer": "AdamW",
26
+ "learning_rate": 0.0005,
27
+ "batch_size": 16,
28
+ "pretrained_from": "Kurdish DASTNUS (unified vocab)"
29
+ }
30
+ }
Urdu-HLR-Model/idx_to_char.json ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "0": "<PAD>",
3
+ "1": "<SOS>",
4
+ "2": "<EOS>",
5
+ "3": " ",
6
+ "4": "!",
7
+ "5": "\"",
8
+ "6": "#",
9
+ "7": "%",
10
+ "8": "&",
11
+ "9": "'",
12
+ "10": "(",
13
+ "11": ")",
14
+ "12": "*",
15
+ "13": "+",
16
+ "14": ",",
17
+ "15": "-",
18
+ "16": ".",
19
+ "17": "/",
20
+ "18": "0",
21
+ "19": "1",
22
+ "20": "2",
23
+ "21": "3",
24
+ "22": "4",
25
+ "23": "5",
26
+ "24": "6",
27
+ "25": "7",
28
+ "26": "8",
29
+ "27": "9",
30
+ "28": ":",
31
+ "29": ";",
32
+ "30": "=",
33
+ "31": ">",
34
+ "32": "?",
35
+ "33": "@",
36
+ "34": "A",
37
+ "35": "B",
38
+ "36": "C",
39
+ "37": "D",
40
+ "38": "E",
41
+ "39": "F",
42
+ "40": "H",
43
+ "41": "I",
44
+ "42": "K",
45
+ "43": "M",
46
+ "44": "P",
47
+ "45": "R",
48
+ "46": "S",
49
+ "47": "Y",
50
+ "48": "[",
51
+ "49": "\\",
52
+ "50": "]",
53
+ "51": "_",
54
+ "52": "a",
55
+ "53": "b",
56
+ "54": "c",
57
+ "55": "d",
58
+ "56": "e",
59
+ "57": "f",
60
+ "58": "g",
61
+ "59": "h",
62
+ "60": "i",
63
+ "61": "l",
64
+ "62": "m",
65
+ "63": "n",
66
+ "64": "o",
67
+ "65": "p",
68
+ "66": "r",
69
+ "67": "s",
70
+ "68": "t",
71
+ "69": "u",
72
+ "70": "v",
73
+ "71": "w",
74
+ "72": "x",
75
+ "73": "y",
76
+ "74": "{",
77
+ "75": "|",
78
+ "76": "}",
79
+ "77": " ",
80
+ "78": "×",
81
+ "79": "÷",
82
+ "80": "،",
83
+ "81": "؎",
84
+ "82": "ؐ",
85
+ "83": "ؑ",
86
+ "84": "ؒ",
87
+ "85": "ؓ",
88
+ "86": "؛",
89
+ "87": "؟",
90
+ "88": "ء",
91
+ "89": "آ",
92
+ "90": "أ",
93
+ "91": "ؤ",
94
+ "92": "إ",
95
+ "93": "ئ",
96
+ "94": "ا",
97
+ "95": "ب",
98
+ "96": "ة",
99
+ "97": "ت",
100
+ "98": "ث",
101
+ "99": "ج",
102
+ "100": "ح",
103
+ "101": "خ",
104
+ "102": "د",
105
+ "103": "ذ",
106
+ "104": "ر",
107
+ "105": "ز",
108
+ "106": "س",
109
+ "107": "ش",
110
+ "108": "ص",
111
+ "109": "ض",
112
+ "110": "ط",
113
+ "111": "ظ",
114
+ "112": "ع",
115
+ "113": "غ",
116
+ "114": "ـ",
117
+ "115": "ف",
118
+ "116": "ق",
119
+ "117": "ك",
120
+ "118": "ل",
121
+ "119": "م",
122
+ "120": "ن",
123
+ "121": "ه",
124
+ "122": "و",
125
+ "123": "وو",
126
+ "124": "ى",
127
+ "125": "ي",
128
+ "126": "ً",
129
+ "127": "ٌ",
130
+ "128": "ٍ",
131
+ "129": "َ",
132
+ "130": "ُ",
133
+ "131": "ِ",
134
+ "132": "ّ",
135
+ "133": "ْ",
136
+ "134": "ٓ",
137
+ "135": "ٔ",
138
+ "136": "٠",
139
+ "137": "١",
140
+ "138": "٢",
141
+ "139": "٣",
142
+ "140": "٤",
143
+ "141": "٥",
144
+ "142": "٦",
145
+ "143": "٧",
146
+ "144": "٨",
147
+ "145": "٩",
148
+ "146": "٪",
149
+ "147": "٬",
150
+ "148": "ٰ",
151
+ "149": "ٹ",
152
+ "150": "پ",
153
+ "151": "چ",
154
+ "152": "ڈ",
155
+ "153": "ڑ",
156
+ "154": "ڕ",
157
+ "155": "ژ",
158
+ "156": "ڤ",
159
+ "157": "ک",
160
+ "158": "گ",
161
+ "159": "ڵ",
162
+ "160": "ں",
163
+ "161": "ھ",
164
+ "162": "ہ",
165
+ "163": "ۂ",
166
+ "164": "ۃ",
167
+ "165": "ۆ",
168
+ "166": "ی",
169
+ "167": "ێ",
170
+ "168": "ے",
171
+ "169": "ۓ",
172
+ "170": "۔",
173
+ "171": "ە",
174
+ "172": "۰",
175
+ "173": "۱",
176
+ "174": "۲",
177
+ "175": "۳",
178
+ "176": "۴",
179
+ "177": "۵",
180
+ "178": "۷",
181
+ "179": "۹",
182
+ "180": "‌",
183
+ "181": "‎",
184
+ "182": "‏",
185
+ "183": "–",
186
+ "184": "‘",
187
+ "185": "’",
188
+ "186": "“",
189
+ "187": "”",
190
+ "188": "…",
191
+ "189": "ﷲ",
192
+ "190": "ﺅ",
193
+ "191": ""
194
+ }
Urdu-HLR-Model/model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8fb2a51ae87b5d215ee7841854208f76f0230633625255137d961e00ae713122
3
+ size 56952648
Urdu-HLR-Model/unified_vocabulary.json ADDED
@@ -0,0 +1,1008 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "dataset": "Unified_Kurdish_Arabic_Urdu",
3
+ "description": "Triple unified vocabulary for Kurdish (DASTNUS), Arabic (KHATT), and Urdu (PUCIT)",
4
+ "vocab_list": [
5
+ "<PAD>",
6
+ "<SOS>",
7
+ "<EOS>",
8
+ " ",
9
+ "!",
10
+ "\"",
11
+ "#",
12
+ "%",
13
+ "&",
14
+ "'",
15
+ "(",
16
+ ")",
17
+ "*",
18
+ "+",
19
+ ",",
20
+ "-",
21
+ ".",
22
+ "/",
23
+ "0",
24
+ "1",
25
+ "2",
26
+ "3",
27
+ "4",
28
+ "5",
29
+ "6",
30
+ "7",
31
+ "8",
32
+ "9",
33
+ ":",
34
+ ";",
35
+ "=",
36
+ ">",
37
+ "?",
38
+ "@",
39
+ "A",
40
+ "B",
41
+ "C",
42
+ "D",
43
+ "E",
44
+ "F",
45
+ "H",
46
+ "I",
47
+ "K",
48
+ "M",
49
+ "P",
50
+ "R",
51
+ "S",
52
+ "Y",
53
+ "[",
54
+ "\\",
55
+ "]",
56
+ "_",
57
+ "a",
58
+ "b",
59
+ "c",
60
+ "d",
61
+ "e",
62
+ "f",
63
+ "g",
64
+ "h",
65
+ "i",
66
+ "l",
67
+ "m",
68
+ "n",
69
+ "o",
70
+ "p",
71
+ "r",
72
+ "s",
73
+ "t",
74
+ "u",
75
+ "v",
76
+ "w",
77
+ "x",
78
+ "y",
79
+ "{",
80
+ "|",
81
+ "}",
82
+ " ",
83
+ "×",
84
+ "÷",
85
+ "،",
86
+ "؎",
87
+ "ؐ",
88
+ "ؑ",
89
+ "ؒ",
90
+ "ؓ",
91
+ "؛",
92
+ "؟",
93
+ "ء",
94
+ "آ",
95
+ "أ",
96
+ "ؤ",
97
+ "إ",
98
+ "ئ",
99
+ "ا",
100
+ "ب",
101
+ "ة",
102
+ "ت",
103
+ "ث",
104
+ "ج",
105
+ "ح",
106
+ "خ",
107
+ "د",
108
+ "ذ",
109
+ "ر",
110
+ "ز",
111
+ "س",
112
+ "ش",
113
+ "ص",
114
+ "ض",
115
+ "ط",
116
+ "ظ",
117
+ "ع",
118
+ "غ",
119
+ "ـ",
120
+ "ف",
121
+ "ق",
122
+ "ك",
123
+ "ل",
124
+ "م",
125
+ "ن",
126
+ "ه",
127
+ "و",
128
+ "وو",
129
+ "ى",
130
+ "ي",
131
+ "ً",
132
+ "ٌ",
133
+ "ٍ",
134
+ "َ",
135
+ "ُ",
136
+ "ِ",
137
+ "ّ",
138
+ "ْ",
139
+ "ٓ",
140
+ "ٔ",
141
+ "٠",
142
+ "١",
143
+ "٢",
144
+ "٣",
145
+ "٤",
146
+ "٥",
147
+ "٦",
148
+ "٧",
149
+ "٨",
150
+ "٩",
151
+ "٪",
152
+ "٬",
153
+ "ٰ",
154
+ "ٹ",
155
+ "پ",
156
+ "چ",
157
+ "ڈ",
158
+ "ڑ",
159
+ "ڕ",
160
+ "ژ",
161
+ "ڤ",
162
+ "ک",
163
+ "گ",
164
+ "ڵ",
165
+ "ں",
166
+ "ھ",
167
+ "ہ",
168
+ "ۂ",
169
+ "ۃ",
170
+ "ۆ",
171
+ "ی",
172
+ "ێ",
173
+ "ے",
174
+ "ۓ",
175
+ "۔",
176
+ "ە",
177
+ "۰",
178
+ "۱",
179
+ "۲",
180
+ "۳",
181
+ "۴",
182
+ "۵",
183
+ "۷",
184
+ "۹",
185
+ "‌",
186
+ "‎",
187
+ "‏",
188
+ "–",
189
+ "‘",
190
+ "’",
191
+ "“",
192
+ "”",
193
+ "…",
194
+ "ﷲ",
195
+ "ﺅ",
196
+ ""
197
+ ],
198
+ "vocab_size": 192,
199
+ "char_to_idx": {
200
+ "<PAD>": 0,
201
+ "<SOS>": 1,
202
+ "<EOS>": 2,
203
+ " ": 3,
204
+ "!": 4,
205
+ "\"": 5,
206
+ "#": 6,
207
+ "%": 7,
208
+ "&": 8,
209
+ "'": 9,
210
+ "(": 10,
211
+ ")": 11,
212
+ "*": 12,
213
+ "+": 13,
214
+ ",": 14,
215
+ "-": 15,
216
+ ".": 16,
217
+ "/": 17,
218
+ "0": 18,
219
+ "1": 19,
220
+ "2": 20,
221
+ "3": 21,
222
+ "4": 22,
223
+ "5": 23,
224
+ "6": 24,
225
+ "7": 25,
226
+ "8": 26,
227
+ "9": 27,
228
+ ":": 28,
229
+ ";": 29,
230
+ "=": 30,
231
+ ">": 31,
232
+ "?": 32,
233
+ "@": 33,
234
+ "A": 34,
235
+ "B": 35,
236
+ "C": 36,
237
+ "D": 37,
238
+ "E": 38,
239
+ "F": 39,
240
+ "H": 40,
241
+ "I": 41,
242
+ "K": 42,
243
+ "M": 43,
244
+ "P": 44,
245
+ "R": 45,
246
+ "S": 46,
247
+ "Y": 47,
248
+ "[": 48,
249
+ "\\": 49,
250
+ "]": 50,
251
+ "_": 51,
252
+ "a": 52,
253
+ "b": 53,
254
+ "c": 54,
255
+ "d": 55,
256
+ "e": 56,
257
+ "f": 57,
258
+ "g": 58,
259
+ "h": 59,
260
+ "i": 60,
261
+ "l": 61,
262
+ "m": 62,
263
+ "n": 63,
264
+ "o": 64,
265
+ "p": 65,
266
+ "r": 66,
267
+ "s": 67,
268
+ "t": 68,
269
+ "u": 69,
270
+ "v": 70,
271
+ "w": 71,
272
+ "x": 72,
273
+ "y": 73,
274
+ "{": 74,
275
+ "|": 75,
276
+ "}": 76,
277
+ " ": 77,
278
+ "×": 78,
279
+ "÷": 79,
280
+ "،": 80,
281
+ "؎": 81,
282
+ "ؐ": 82,
283
+ "ؑ": 83,
284
+ "ؒ": 84,
285
+ "ؓ": 85,
286
+ "؛": 86,
287
+ "؟": 87,
288
+ "ء": 88,
289
+ "آ": 89,
290
+ "أ": 90,
291
+ "ؤ": 91,
292
+ "إ": 92,
293
+ "ئ": 93,
294
+ "ا": 94,
295
+ "ب": 95,
296
+ "ة": 96,
297
+ "ت": 97,
298
+ "ث": 98,
299
+ "ج": 99,
300
+ "ح": 100,
301
+ "خ": 101,
302
+ "د": 102,
303
+ "ذ": 103,
304
+ "ر": 104,
305
+ "ز": 105,
306
+ "س": 106,
307
+ "ش": 107,
308
+ "ص": 108,
309
+ "ض": 109,
310
+ "ط": 110,
311
+ "ظ": 111,
312
+ "ع": 112,
313
+ "غ": 113,
314
+ "ـ": 114,
315
+ "ف": 115,
316
+ "ق": 116,
317
+ "ك": 117,
318
+ "ل": 118,
319
+ "م": 119,
320
+ "ن": 120,
321
+ "ه": 121,
322
+ "و": 122,
323
+ "وو": 123,
324
+ "ى": 124,
325
+ "ي": 125,
326
+ "ً": 126,
327
+ "ٌ": 127,
328
+ "ٍ": 128,
329
+ "َ": 129,
330
+ "ُ": 130,
331
+ "ِ": 131,
332
+ "ّ": 132,
333
+ "ْ": 133,
334
+ "ٓ": 134,
335
+ "ٔ": 135,
336
+ "٠": 136,
337
+ "١": 137,
338
+ "٢": 138,
339
+ "٣": 139,
340
+ "٤": 140,
341
+ "٥": 141,
342
+ "٦": 142,
343
+ "٧": 143,
344
+ "٨": 144,
345
+ "٩": 145,
346
+ "٪": 146,
347
+ "٬": 147,
348
+ "ٰ": 148,
349
+ "ٹ": 149,
350
+ "پ": 150,
351
+ "چ": 151,
352
+ "ڈ": 152,
353
+ "ڑ": 153,
354
+ "ڕ": 154,
355
+ "ژ": 155,
356
+ "ڤ": 156,
357
+ "ک": 157,
358
+ "گ": 158,
359
+ "ڵ": 159,
360
+ "ں": 160,
361
+ "ھ": 161,
362
+ "ہ": 162,
363
+ "ۂ": 163,
364
+ "ۃ": 164,
365
+ "ۆ": 165,
366
+ "ی": 166,
367
+ "ێ": 167,
368
+ "ے": 168,
369
+ "ۓ": 169,
370
+ "۔": 170,
371
+ "ە": 171,
372
+ "۰": 172,
373
+ "۱": 173,
374
+ "۲": 174,
375
+ "۳": 175,
376
+ "۴": 176,
377
+ "۵": 177,
378
+ "۷": 178,
379
+ "۹": 179,
380
+ "‌": 180,
381
+ "‎": 181,
382
+ "‏": 182,
383
+ "–": 183,
384
+ "‘": 184,
385
+ "’": 185,
386
+ "“": 186,
387
+ "”": 187,
388
+ "…": 188,
389
+ "ﷲ": 189,
390
+ "ﺅ": 190,
391
+ "": 191
392
+ },
393
+ "idx_to_char": {
394
+ "0": "<PAD>",
395
+ "1": "<SOS>",
396
+ "2": "<EOS>",
397
+ "3": " ",
398
+ "4": "!",
399
+ "5": "\"",
400
+ "6": "#",
401
+ "7": "%",
402
+ "8": "&",
403
+ "9": "'",
404
+ "10": "(",
405
+ "11": ")",
406
+ "12": "*",
407
+ "13": "+",
408
+ "14": ",",
409
+ "15": "-",
410
+ "16": ".",
411
+ "17": "/",
412
+ "18": "0",
413
+ "19": "1",
414
+ "20": "2",
415
+ "21": "3",
416
+ "22": "4",
417
+ "23": "5",
418
+ "24": "6",
419
+ "25": "7",
420
+ "26": "8",
421
+ "27": "9",
422
+ "28": ":",
423
+ "29": ";",
424
+ "30": "=",
425
+ "31": ">",
426
+ "32": "?",
427
+ "33": "@",
428
+ "34": "A",
429
+ "35": "B",
430
+ "36": "C",
431
+ "37": "D",
432
+ "38": "E",
433
+ "39": "F",
434
+ "40": "H",
435
+ "41": "I",
436
+ "42": "K",
437
+ "43": "M",
438
+ "44": "P",
439
+ "45": "R",
440
+ "46": "S",
441
+ "47": "Y",
442
+ "48": "[",
443
+ "49": "\\",
444
+ "50": "]",
445
+ "51": "_",
446
+ "52": "a",
447
+ "53": "b",
448
+ "54": "c",
449
+ "55": "d",
450
+ "56": "e",
451
+ "57": "f",
452
+ "58": "g",
453
+ "59": "h",
454
+ "60": "i",
455
+ "61": "l",
456
+ "62": "m",
457
+ "63": "n",
458
+ "64": "o",
459
+ "65": "p",
460
+ "66": "r",
461
+ "67": "s",
462
+ "68": "t",
463
+ "69": "u",
464
+ "70": "v",
465
+ "71": "w",
466
+ "72": "x",
467
+ "73": "y",
468
+ "74": "{",
469
+ "75": "|",
470
+ "76": "}",
471
+ "77": " ",
472
+ "78": "×",
473
+ "79": "÷",
474
+ "80": "،",
475
+ "81": "؎",
476
+ "82": "ؐ",
477
+ "83": "ؑ",
478
+ "84": "ؒ",
479
+ "85": "ؓ",
480
+ "86": "؛",
481
+ "87": "؟",
482
+ "88": "ء",
483
+ "89": "آ",
484
+ "90": "أ",
485
+ "91": "ؤ",
486
+ "92": "إ",
487
+ "93": "ئ",
488
+ "94": "ا",
489
+ "95": "ب",
490
+ "96": "ة",
491
+ "97": "ت",
492
+ "98": "ث",
493
+ "99": "ج",
494
+ "100": "ح",
495
+ "101": "خ",
496
+ "102": "د",
497
+ "103": "ذ",
498
+ "104": "ر",
499
+ "105": "ز",
500
+ "106": "س",
501
+ "107": "ش",
502
+ "108": "ص",
503
+ "109": "ض",
504
+ "110": "ط",
505
+ "111": "ظ",
506
+ "112": "ع",
507
+ "113": "غ",
508
+ "114": "ـ",
509
+ "115": "ف",
510
+ "116": "ق",
511
+ "117": "ك",
512
+ "118": "ل",
513
+ "119": "م",
514
+ "120": "ن",
515
+ "121": "ه",
516
+ "122": "و",
517
+ "123": "وو",
518
+ "124": "ى",
519
+ "125": "ي",
520
+ "126": "ً",
521
+ "127": "ٌ",
522
+ "128": "ٍ",
523
+ "129": "َ",
524
+ "130": "ُ",
525
+ "131": "ِ",
526
+ "132": "ّ",
527
+ "133": "ْ",
528
+ "134": "ٓ",
529
+ "135": "ٔ",
530
+ "136": "٠",
531
+ "137": "١",
532
+ "138": "٢",
533
+ "139": "٣",
534
+ "140": "٤",
535
+ "141": "٥",
536
+ "142": "٦",
537
+ "143": "٧",
538
+ "144": "٨",
539
+ "145": "٩",
540
+ "146": "٪",
541
+ "147": "٬",
542
+ "148": "ٰ",
543
+ "149": "ٹ",
544
+ "150": "پ",
545
+ "151": "چ",
546
+ "152": "ڈ",
547
+ "153": "ڑ",
548
+ "154": "ڕ",
549
+ "155": "ژ",
550
+ "156": "ڤ",
551
+ "157": "ک",
552
+ "158": "گ",
553
+ "159": "ڵ",
554
+ "160": "ں",
555
+ "161": "ھ",
556
+ "162": "ہ",
557
+ "163": "ۂ",
558
+ "164": "ۃ",
559
+ "165": "ۆ",
560
+ "166": "ی",
561
+ "167": "ێ",
562
+ "168": "ے",
563
+ "169": "ۓ",
564
+ "170": "۔",
565
+ "171": "ە",
566
+ "172": "۰",
567
+ "173": "۱",
568
+ "174": "۲",
569
+ "175": "۳",
570
+ "176": "۴",
571
+ "177": "۵",
572
+ "178": "۷",
573
+ "179": "۹",
574
+ "180": "‌",
575
+ "181": "‎",
576
+ "182": "‏",
577
+ "183": "–",
578
+ "184": "‘",
579
+ "185": "’",
580
+ "186": "“",
581
+ "187": "”",
582
+ "188": "…",
583
+ "189": "ﷲ",
584
+ "190": "ﺅ",
585
+ "191": ""
586
+ },
587
+ "kurdish_char_count": 112,
588
+ "arabic_char_count": 84,
589
+ "urdu_char_count": 143,
590
+ "all_three_overlap_count": 49,
591
+ "all_three_overlap_chars": [
592
+ " ",
593
+ "!",
594
+ "\"",
595
+ "(",
596
+ ")",
597
+ "*",
598
+ "-",
599
+ ".",
600
+ "/",
601
+ "0",
602
+ "1",
603
+ "2",
604
+ "4",
605
+ ":",
606
+ ";",
607
+ "[",
608
+ "]",
609
+ "x",
610
+ "،",
611
+ "؛",
612
+ "؟",
613
+ "ء",
614
+ "أ",
615
+ "ؤ",
616
+ "ئ",
617
+ "ا",
618
+ "ب",
619
+ "ت",
620
+ "ث",
621
+ "ج",
622
+ "ح",
623
+ "خ",
624
+ "د",
625
+ "ذ",
626
+ "ر",
627
+ "ز",
628
+ "س",
629
+ "ش",
630
+ "ص",
631
+ "ط",
632
+ "ع",
633
+ "غ",
634
+ "ف",
635
+ "ق",
636
+ "ل",
637
+ "م",
638
+ "ن",
639
+ "و",
640
+ "ي"
641
+ ],
642
+ "kurdish_only_count": 28,
643
+ "arabic_only_count": 7,
644
+ "urdu_only_count": 53,
645
+ "kurdish_only_chars": [
646
+ "&",
647
+ "@",
648
+ "C",
649
+ "D",
650
+ "H",
651
+ "_",
652
+ "h",
653
+ "{",
654
+ "|",
655
+ "}",
656
+ "÷",
657
+ "وو",
658
+ "٠",
659
+ "١",
660
+ "٢",
661
+ "٣",
662
+ "٤",
663
+ "٥",
664
+ "٧",
665
+ "٩",
666
+ "٪",
667
+ "ڕ",
668
+ "ڤ",
669
+ "ڵ",
670
+ "ۆ",
671
+ "ێ",
672
+ "ە",
673
+ "‎"
674
+ ],
675
+ "arabic_only_chars": [
676
+ ">",
677
+ "?",
678
+ "\\",
679
+ " ",
680
+ "إ",
681
+ "ٌ",
682
+ "ٍ"
683
+ ],
684
+ "urdu_only_chars": [
685
+ "A",
686
+ "B",
687
+ "E",
688
+ "I",
689
+ "K",
690
+ "M",
691
+ "R",
692
+ "S",
693
+ "Y",
694
+ "b",
695
+ "f",
696
+ "g",
697
+ "i",
698
+ "l",
699
+ "n",
700
+ "r",
701
+ "u",
702
+ "v",
703
+ "w",
704
+ "y",
705
+ "؎",
706
+ "ؐ",
707
+ "ؑ",
708
+ "ؒ",
709
+ "ؓ",
710
+ "ٓ",
711
+ "ٔ",
712
+ "٬",
713
+ "ٰ",
714
+ "ٹ",
715
+ "ڈ",
716
+ "ڑ",
717
+ "ں",
718
+ "ہ",
719
+ "ۂ",
720
+ "ۃ",
721
+ "ے",
722
+ "ۓ",
723
+ "۰",
724
+ "۱",
725
+ "۲",
726
+ "۳",
727
+ "۴",
728
+ "۵",
729
+ "۷",
730
+ "۹",
731
+ "’",
732
+ "“",
733
+ "”",
734
+ "…",
735
+ "ﷲ",
736
+ "ﺅ",
737
+ ""
738
+ ],
739
+ "kurdish_arabic_overlap_count": 60,
740
+ "kurdish_urdu_overlap_count": 73,
741
+ "arabic_urdu_overlap_count": 66,
742
+ "kurdish_arabic_overlap_chars": [
743
+ " ",
744
+ "!",
745
+ "\"",
746
+ "#",
747
+ "%",
748
+ "(",
749
+ ")",
750
+ "*",
751
+ "+",
752
+ "-",
753
+ ".",
754
+ "/",
755
+ "0",
756
+ "1",
757
+ "2",
758
+ "4",
759
+ ":",
760
+ ";",
761
+ "=",
762
+ "[",
763
+ "]",
764
+ "x",
765
+ "×",
766
+ "،",
767
+ "؛",
768
+ "؟",
769
+ "ء",
770
+ "أ",
771
+ "ؤ",
772
+ "ئ",
773
+ "ا",
774
+ "ب",
775
+ "ة",
776
+ "ت",
777
+ "ث",
778
+ "ج",
779
+ "ح",
780
+ "خ",
781
+ "د",
782
+ "ذ",
783
+ "ر",
784
+ "ز",
785
+ "س",
786
+ "ش",
787
+ "ص",
788
+ "ط",
789
+ "ع",
790
+ "غ",
791
+ "ـ",
792
+ "ف",
793
+ "ق",
794
+ "ك",
795
+ "ل",
796
+ "م",
797
+ "ن",
798
+ "ه",
799
+ "و",
800
+ "ى",
801
+ "ي",
802
+ "–"
803
+ ],
804
+ "kurdish_urdu_overlap_chars": [
805
+ " ",
806
+ "!",
807
+ "\"",
808
+ "'",
809
+ "(",
810
+ ")",
811
+ "*",
812
+ "-",
813
+ ".",
814
+ "/",
815
+ "0",
816
+ "1",
817
+ "2",
818
+ "4",
819
+ ":",
820
+ ";",
821
+ "F",
822
+ "P",
823
+ "[",
824
+ "]",
825
+ "a",
826
+ "c",
827
+ "d",
828
+ "e",
829
+ "m",
830
+ "o",
831
+ "p",
832
+ "s",
833
+ "t",
834
+ "x",
835
+ "،",
836
+ "؛",
837
+ "؟",
838
+ "ء",
839
+ "أ",
840
+ "ؤ",
841
+ "ئ",
842
+ "ا",
843
+ "ب",
844
+ "ت",
845
+ "ث",
846
+ "ج",
847
+ "ح",
848
+ "خ",
849
+ "د",
850
+ "ذ",
851
+ "ر",
852
+ "ز",
853
+ "س",
854
+ "ش",
855
+ "ص",
856
+ "ط",
857
+ "ع",
858
+ "غ",
859
+ "ف",
860
+ "ق",
861
+ "ل",
862
+ "م",
863
+ "ن",
864
+ "و",
865
+ "ي",
866
+ "٦",
867
+ "٨",
868
+ "پ",
869
+ "چ",
870
+ "ژ",
871
+ "ک",
872
+ "گ",
873
+ "ھ",
874
+ "ی",
875
+ "۔",
876
+ "‌",
877
+ "‏"
878
+ ],
879
+ "arabic_urdu_overlap_chars": [
880
+ " ",
881
+ "!",
882
+ "\"",
883
+ "(",
884
+ ")",
885
+ "*",
886
+ ",",
887
+ "-",
888
+ ".",
889
+ "/",
890
+ "0",
891
+ "1",
892
+ "2",
893
+ "3",
894
+ "4",
895
+ "5",
896
+ "6",
897
+ "7",
898
+ "8",
899
+ "9",
900
+ ":",
901
+ ";",
902
+ "[",
903
+ "]",
904
+ "x",
905
+ "،",
906
+ "؛",
907
+ "؟",
908
+ "ء",
909
+ "آ",
910
+ "أ",
911
+ "ؤ",
912
+ "ئ",
913
+ "ا",
914
+ "ب",
915
+ "ت",
916
+ "ث",
917
+ "ج",
918
+ "ح",
919
+ "خ",
920
+ "د",
921
+ "ذ",
922
+ "ر",
923
+ "ز",
924
+ "س",
925
+ "ش",
926
+ "ص",
927
+ "ض",
928
+ "ط",
929
+ "ظ",
930
+ "ع",
931
+ "غ",
932
+ "ف",
933
+ "ق",
934
+ "ل",
935
+ "م",
936
+ "ن",
937
+ "و",
938
+ "ي",
939
+ "ً",
940
+ "َ",
941
+ "ُ",
942
+ "ِ",
943
+ "ّ",
944
+ "ْ",
945
+ "‘"
946
+ ],
947
+ "kurdish_arabic_only_count": 11,
948
+ "kurdish_urdu_only_count": 24,
949
+ "arabic_urdu_only_count": 17,
950
+ "kurdish_arabic_only_chars": [
951
+ "#",
952
+ "%",
953
+ "+",
954
+ "=",
955
+ "×",
956
+ "ة",
957
+ "ـ",
958
+ "ك",
959
+ "ه",
960
+ "ى",
961
+ "–"
962
+ ],
963
+ "kurdish_urdu_only_chars": [
964
+ "'",
965
+ "F",
966
+ "P",
967
+ "a",
968
+ "c",
969
+ "d",
970
+ "e",
971
+ "m",
972
+ "o",
973
+ "p",
974
+ "s",
975
+ "t",
976
+ "٦",
977
+ "٨",
978
+ "پ",
979
+ "چ",
980
+ "ژ",
981
+ "ک",
982
+ "گ",
983
+ "ھ",
984
+ "ی",
985
+ "۔",
986
+ "‌",
987
+ "‏"
988
+ ],
989
+ "arabic_urdu_only_chars": [
990
+ ",",
991
+ "3",
992
+ "5",
993
+ "6",
994
+ "7",
995
+ "8",
996
+ "9",
997
+ "آ",
998
+ "ض",
999
+ "ظ",
1000
+ "ً",
1001
+ "َ",
1002
+ "ُ",
1003
+ "ِ",
1004
+ "ّ",
1005
+ "ْ",
1006
+ "‘"
1007
+ ]
1008
+ }
Urdu-HLR-Model/vocab.json ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "<PAD>": 0,
3
+ "<SOS>": 1,
4
+ "<EOS>": 2,
5
+ " ": 3,
6
+ "!": 4,
7
+ "\"": 5,
8
+ "#": 6,
9
+ "%": 7,
10
+ "&": 8,
11
+ "'": 9,
12
+ "(": 10,
13
+ ")": 11,
14
+ "*": 12,
15
+ "+": 13,
16
+ ",": 14,
17
+ "-": 15,
18
+ ".": 16,
19
+ "/": 17,
20
+ "0": 18,
21
+ "1": 19,
22
+ "2": 20,
23
+ "3": 21,
24
+ "4": 22,
25
+ "5": 23,
26
+ "6": 24,
27
+ "7": 25,
28
+ "8": 26,
29
+ "9": 27,
30
+ ":": 28,
31
+ ";": 29,
32
+ "=": 30,
33
+ ">": 31,
34
+ "?": 32,
35
+ "@": 33,
36
+ "A": 34,
37
+ "B": 35,
38
+ "C": 36,
39
+ "D": 37,
40
+ "E": 38,
41
+ "F": 39,
42
+ "H": 40,
43
+ "I": 41,
44
+ "K": 42,
45
+ "M": 43,
46
+ "P": 44,
47
+ "R": 45,
48
+ "S": 46,
49
+ "Y": 47,
50
+ "[": 48,
51
+ "\\": 49,
52
+ "]": 50,
53
+ "_": 51,
54
+ "a": 52,
55
+ "b": 53,
56
+ "c": 54,
57
+ "d": 55,
58
+ "e": 56,
59
+ "f": 57,
60
+ "g": 58,
61
+ "h": 59,
62
+ "i": 60,
63
+ "l": 61,
64
+ "m": 62,
65
+ "n": 63,
66
+ "o": 64,
67
+ "p": 65,
68
+ "r": 66,
69
+ "s": 67,
70
+ "t": 68,
71
+ "u": 69,
72
+ "v": 70,
73
+ "w": 71,
74
+ "x": 72,
75
+ "y": 73,
76
+ "{": 74,
77
+ "|": 75,
78
+ "}": 76,
79
+ " ": 77,
80
+ "×": 78,
81
+ "÷": 79,
82
+ "،": 80,
83
+ "؎": 81,
84
+ "ؐ": 82,
85
+ "ؑ": 83,
86
+ "ؒ": 84,
87
+ "ؓ": 85,
88
+ "؛": 86,
89
+ "؟": 87,
90
+ "ء": 88,
91
+ "آ": 89,
92
+ "أ": 90,
93
+ "ؤ": 91,
94
+ "إ": 92,
95
+ "ئ": 93,
96
+ "ا": 94,
97
+ "ب": 95,
98
+ "ة": 96,
99
+ "ت": 97,
100
+ "ث": 98,
101
+ "ج": 99,
102
+ "ح": 100,
103
+ "خ": 101,
104
+ "د": 102,
105
+ "ذ": 103,
106
+ "ر": 104,
107
+ "ز": 105,
108
+ "س": 106,
109
+ "ش": 107,
110
+ "ص": 108,
111
+ "ض": 109,
112
+ "ط": 110,
113
+ "ظ": 111,
114
+ "ع": 112,
115
+ "غ": 113,
116
+ "ـ": 114,
117
+ "ف": 115,
118
+ "ق": 116,
119
+ "ك": 117,
120
+ "ل": 118,
121
+ "م": 119,
122
+ "ن": 120,
123
+ "ه": 121,
124
+ "و": 122,
125
+ "وو": 123,
126
+ "ى": 124,
127
+ "ي": 125,
128
+ "ً": 126,
129
+ "ٌ": 127,
130
+ "ٍ": 128,
131
+ "َ": 129,
132
+ "ُ": 130,
133
+ "ِ": 131,
134
+ "ّ": 132,
135
+ "ْ": 133,
136
+ "ٓ": 134,
137
+ "ٔ": 135,
138
+ "٠": 136,
139
+ "١": 137,
140
+ "٢": 138,
141
+ "٣": 139,
142
+ "٤": 140,
143
+ "٥": 141,
144
+ "٦": 142,
145
+ "٧": 143,
146
+ "٨": 144,
147
+ "٩": 145,
148
+ "٪": 146,
149
+ "٬": 147,
150
+ "ٰ": 148,
151
+ "ٹ": 149,
152
+ "پ": 150,
153
+ "چ": 151,
154
+ "ڈ": 152,
155
+ "ڑ": 153,
156
+ "ڕ": 154,
157
+ "ژ": 155,
158
+ "ڤ": 156,
159
+ "ک": 157,
160
+ "گ": 158,
161
+ "ڵ": 159,
162
+ "ں": 160,
163
+ "ھ": 161,
164
+ "ہ": 162,
165
+ "ۂ": 163,
166
+ "ۃ": 164,
167
+ "ۆ": 165,
168
+ "ی": 166,
169
+ "ێ": 167,
170
+ "ے": 168,
171
+ "ۓ": 169,
172
+ "۔": 170,
173
+ "ە": 171,
174
+ "۰": 172,
175
+ "۱": 173,
176
+ "۲": 174,
177
+ "۳": 175,
178
+ "۴": 176,
179
+ "۵": 177,
180
+ "۷": 178,
181
+ "۹": 179,
182
+ "‌": 180,
183
+ "‎": 181,
184
+ "‏": 182,
185
+ "–": 183,
186
+ "‘": 184,
187
+ "’": 185,
188
+ "“": 186,
189
+ "”": 187,
190
+ "…": 188,
191
+ "ﷲ": 189,
192
+ "ﺅ": 190,
193
+ "": 191
194
+ }