yoriis commited on
Commit
910a2f8
·
verified ·
1 Parent(s): d5a0bac

Add new CrossEncoder model

Browse files
Files changed (7) hide show
  1. README.md +385 -0
  2. config.json +34 -0
  3. model.safetensors +3 -0
  4. special_tokens_map.json +37 -0
  5. tokenizer.json +0 -0
  6. tokenizer_config.json +94 -0
  7. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,385 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - cross-encoder
5
+ - generated_from_trainer
6
+ - dataset_size:12128
7
+ - loss:BinaryCrossEntropyLoss
8
+ base_model: yoriis/ce-tydi
9
+ pipeline_tag: text-ranking
10
+ library_name: sentence-transformers
11
+ metrics:
12
+ - accuracy
13
+ - accuracy_threshold
14
+ - f1
15
+ - f1_threshold
16
+ - precision
17
+ - recall
18
+ - average_precision
19
+ model-index:
20
+ - name: CrossEncoder based on yoriis/ce-tydi
21
+ results:
22
+ - task:
23
+ type: cross-encoder-classification
24
+ name: Cross Encoder Classification
25
+ dataset:
26
+ name: eval
27
+ type: eval
28
+ metrics:
29
+ - type: accuracy
30
+ value: 0.9347181008902077
31
+ name: Accuracy
32
+ - type: accuracy_threshold
33
+ value: 0.641675591468811
34
+ name: Accuracy Threshold
35
+ - type: f1
36
+ value: 0.8668639053254438
37
+ name: F1
38
+ - type: f1_threshold
39
+ value: 0.303142249584198
40
+ name: F1 Threshold
41
+ - type: precision
42
+ value: 0.8643067846607669
43
+ name: Precision
44
+ - type: recall
45
+ value: 0.8694362017804155
46
+ name: Recall
47
+ - type: average_precision
48
+ value: 0.9277836243055002
49
+ name: Average Precision
50
+ ---
51
+
52
+ # CrossEncoder based on yoriis/ce-tydi
53
+
54
+ This is a [Cross Encoder](https://www.sbert.net/docs/cross_encoder/usage/usage.html) model finetuned from [yoriis/ce-tydi](https://huggingface.co/yoriis/ce-tydi) using the [sentence-transformers](https://www.SBERT.net) library. It computes scores for pairs of texts, which can be used for text reranking and semantic search.
55
+
56
+ ## Model Details
57
+
58
+ ### Model Description
59
+ - **Model Type:** Cross Encoder
60
+ - **Base model:** [yoriis/ce-tydi](https://huggingface.co/yoriis/ce-tydi) <!-- at revision adbd5e3122de4b21a7cd3fc2e5f7fa1aed62b1ee -->
61
+ - **Maximum Sequence Length:** 512 tokens
62
+ - **Number of Output Labels:** 1 label
63
+ <!-- - **Training Dataset:** Unknown -->
64
+ <!-- - **Language:** Unknown -->
65
+ <!-- - **License:** Unknown -->
66
+
67
+ ### Model Sources
68
+
69
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
70
+ - **Documentation:** [Cross Encoder Documentation](https://www.sbert.net/docs/cross_encoder/usage/usage.html)
71
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
72
+ - **Hugging Face:** [Cross Encoders on Hugging Face](https://huggingface.co/models?library=sentence-transformers&other=cross-encoder)
73
+
74
+ ## Usage
75
+
76
+ ### Direct Usage (Sentence Transformers)
77
+
78
+ First install the Sentence Transformers library:
79
+
80
+ ```bash
81
+ pip install -U sentence-transformers
82
+ ```
83
+
84
+ Then you can load this model and run inference.
85
+ ```python
86
+ from sentence_transformers import CrossEncoder
87
+
88
+ # Download from the 🤗 Hub
89
+ model = CrossEncoder("yoriis/ce-tydi-quqa")
90
+ # Get scores for pairs of texts
91
+ pairs = [
92
+ ['ما وقت حلول صاعقة العذاب بقوم لوط\xa0عليه السلام؟', 'قل إنما حرم ربي الفواحش ما ظهر منها وما بطن والإثم والبغي بغير الحق وأن تشركوا بالله ما لم ينزل به سلطانا وأن تقولوا على الله ما لا تعلمون{33}الأعراف.'],
93
+ ['ما أول دعاء في القرآن ؟', 'كذلك يوحي إليك وإلى الذين من قبلك الله العزيز الحكيم {3}الشورى'],
94
+ ['ما هي شروط قبول التوبة؟', 'إن الذين يكفرون بالله ورسله ويريدون أن يفرقوا بين الله ورسله ويقولون نؤمن ببعض ونكفر ببعض ويريدون أن يتخذوا بين ذلك سبيلا{150} أولـئك هم الكافرون حقا وأعتدنا للكافرين عذابا مهينا{151} النساء.'],
95
+ ['ما هي شروط شهادة لا إله الا الله، وأن محمدا رسول الله ؟', 'ثم تولوا عنه وقالوا معلم مجنون{14} الدخان'],
96
+ ['ما هي اسماء المدن المذكورة في القرآن؟', 'فلولا كانت قرية آمنت فنفعها إيمانها إلا قوم يونس لما آمنوا كشفنا عنهم عذاب الخزي في الحياة الدنيا ومتعناهم إلى حين{98} يونس'],
97
+ ]
98
+ scores = model.predict(pairs)
99
+ print(scores.shape)
100
+ # (5,)
101
+
102
+ # Or rank different texts based on similarity to a single text
103
+ ranks = model.rank(
104
+ 'ما وقت حلول صاعقة العذاب بقوم لوط\xa0عليه السلام؟',
105
+ [
106
+ 'قل إنما حرم ربي الفواحش ما ظهر منها وما بطن والإثم والبغي بغير الحق وأن تشركوا بالله ما لم ينزل به سلطانا وأن تقولوا على الله ما لا تعلمون{33}الأعراف.',
107
+ 'كذلك يوحي إليك وإلى الذين من قبلك الله العزيز الحكيم {3}الشورى',
108
+ 'إن الذين يكفرون بالله ورسله ويريدون أن يفرقوا بين الله ورسله ويقولون نؤمن ببعض ونكفر ببعض ويريدون أن يتخذوا بين ذلك سبيلا{150} أولـئك هم الكافرون حقا وأعتدنا للكافرين عذابا مهينا{151} النساء.',
109
+ 'ثم تولوا عنه وقالوا معلم مجنون{14} الدخان',
110
+ 'فلولا كانت قرية آمنت فنفعها إيمانها إلا قوم يونس لما آمنوا كشفنا عنهم عذاب الخزي في الحياة الدنيا ومتعناهم إلى حين{98} يونس',
111
+ ]
112
+ )
113
+ # [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]
114
+ ```
115
+
116
+ <!--
117
+ ### Direct Usage (Transformers)
118
+
119
+ <details><summary>Click to see the direct usage in Transformers</summary>
120
+
121
+ </details>
122
+ -->
123
+
124
+ <!--
125
+ ### Downstream Usage (Sentence Transformers)
126
+
127
+ You can finetune this model on your own dataset.
128
+
129
+ <details><summary>Click to expand</summary>
130
+
131
+ </details>
132
+ -->
133
+
134
+ <!--
135
+ ### Out-of-Scope Use
136
+
137
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
138
+ -->
139
+
140
+ ## Evaluation
141
+
142
+ ### Metrics
143
+
144
+ #### Cross Encoder Classification
145
+
146
+ * Dataset: `eval`
147
+ * Evaluated with [<code>CrossEncoderClassificationEvaluator</code>](https://sbert.net/docs/package_reference/cross_encoder/evaluation.html#sentence_transformers.cross_encoder.evaluation.CrossEncoderClassificationEvaluator)
148
+
149
+ | Metric | Value |
150
+ |:----------------------|:-----------|
151
+ | accuracy | 0.9347 |
152
+ | accuracy_threshold | 0.6417 |
153
+ | f1 | 0.8669 |
154
+ | f1_threshold | 0.3031 |
155
+ | precision | 0.8643 |
156
+ | recall | 0.8694 |
157
+ | **average_precision** | **0.9278** |
158
+
159
+ <!--
160
+ ## Bias, Risks and Limitations
161
+
162
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
163
+ -->
164
+
165
+ <!--
166
+ ### Recommendations
167
+
168
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
169
+ -->
170
+
171
+ ## Training Details
172
+
173
+ ### Training Dataset
174
+
175
+ #### Unnamed Dataset
176
+
177
+ * Size: 12,128 training samples
178
+ * Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>label</code>
179
+ * Approximate statistics based on the first 1000 samples:
180
+ | | sentence_0 | sentence_1 | label |
181
+ |:--------|:-----------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------|:---------------------------------------------------------------|
182
+ | type | string | string | float |
183
+ | details | <ul><li>min: 9 characters</li><li>mean: 74.83 characters</li><li>max: 659 characters</li></ul> | <ul><li>min: 15 characters</li><li>mean: 130.39 characters</li><li>max: 1279 characters</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.27</li><li>max: 1.0</li></ul> |
184
+ * Samples:
185
+ | sentence_0 | sentence_1 | label |
186
+ |:------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------|
187
+ | <code>ما وقت حلول صاعقة العذاب بقوم لوط عليه السلام؟</code> | <code>قل إنما حرم ربي الفواحش ما ظهر منها وما بطن والإثم والبغي بغير الحق وأن تشركوا بالله ما لم ينزل به سلطانا وأن تقولوا على الله ما لا تعلمون{33}الأعراف.</code> | <code>0.0</code> |
188
+ | <code>ما أول دعاء في القرآن ؟</code> | <code>كذلك يوحي إليك وإلى الذين من قبلك الله العزيز الحكيم {3}الشورى</code> | <code>0.0</code> |
189
+ | <code>ما هي شروط قبول التوبة؟</code> | <code>إن الذين يكفرون بالله ورسله ويريدون أن يفرقوا بين الله ورسله ويقولون نؤمن ببعض ونكفر ببعض ويريدون أن يتخذوا بين ذلك سبيلا{150} أولـئك هم الكافرون حقا وأعتدنا للكافرين عذابا مهينا{151} النساء.</code> | <code>0.0</code> |
190
+ * Loss: [<code>BinaryCrossEntropyLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#binarycrossentropyloss) with these parameters:
191
+ ```json
192
+ {
193
+ "activation_fn": "torch.nn.modules.linear.Identity",
194
+ "pos_weight": null
195
+ }
196
+ ```
197
+
198
+ ### Training Hyperparameters
199
+ #### Non-Default Hyperparameters
200
+
201
+ - `eval_strategy`: steps
202
+ - `per_device_train_batch_size`: 16
203
+ - `per_device_eval_batch_size`: 16
204
+ - `num_train_epochs`: 4
205
+ - `fp16`: True
206
+
207
+ #### All Hyperparameters
208
+ <details><summary>Click to expand</summary>
209
+
210
+ - `overwrite_output_dir`: False
211
+ - `do_predict`: False
212
+ - `eval_strategy`: steps
213
+ - `prediction_loss_only`: True
214
+ - `per_device_train_batch_size`: 16
215
+ - `per_device_eval_batch_size`: 16
216
+ - `per_gpu_train_batch_size`: None
217
+ - `per_gpu_eval_batch_size`: None
218
+ - `gradient_accumulation_steps`: 1
219
+ - `eval_accumulation_steps`: None
220
+ - `torch_empty_cache_steps`: None
221
+ - `learning_rate`: 5e-05
222
+ - `weight_decay`: 0.0
223
+ - `adam_beta1`: 0.9
224
+ - `adam_beta2`: 0.999
225
+ - `adam_epsilon`: 1e-08
226
+ - `max_grad_norm`: 1
227
+ - `num_train_epochs`: 4
228
+ - `max_steps`: -1
229
+ - `lr_scheduler_type`: linear
230
+ - `lr_scheduler_kwargs`: {}
231
+ - `warmup_ratio`: 0.0
232
+ - `warmup_steps`: 0
233
+ - `log_level`: passive
234
+ - `log_level_replica`: warning
235
+ - `log_on_each_node`: True
236
+ - `logging_nan_inf_filter`: True
237
+ - `save_safetensors`: True
238
+ - `save_on_each_node`: False
239
+ - `save_only_model`: False
240
+ - `restore_callback_states_from_checkpoint`: False
241
+ - `no_cuda`: False
242
+ - `use_cpu`: False
243
+ - `use_mps_device`: False
244
+ - `seed`: 42
245
+ - `data_seed`: None
246
+ - `jit_mode_eval`: False
247
+ - `use_ipex`: False
248
+ - `bf16`: False
249
+ - `fp16`: True
250
+ - `fp16_opt_level`: O1
251
+ - `half_precision_backend`: auto
252
+ - `bf16_full_eval`: False
253
+ - `fp16_full_eval`: False
254
+ - `tf32`: None
255
+ - `local_rank`: 0
256
+ - `ddp_backend`: None
257
+ - `tpu_num_cores`: None
258
+ - `tpu_metrics_debug`: False
259
+ - `debug`: []
260
+ - `dataloader_drop_last`: False
261
+ - `dataloader_num_workers`: 0
262
+ - `dataloader_prefetch_factor`: None
263
+ - `past_index`: -1
264
+ - `disable_tqdm`: False
265
+ - `remove_unused_columns`: True
266
+ - `label_names`: None
267
+ - `load_best_model_at_end`: False
268
+ - `ignore_data_skip`: False
269
+ - `fsdp`: []
270
+ - `fsdp_min_num_params`: 0
271
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
272
+ - `fsdp_transformer_layer_cls_to_wrap`: None
273
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
274
+ - `deepspeed`: None
275
+ - `label_smoothing_factor`: 0.0
276
+ - `optim`: adamw_torch
277
+ - `optim_args`: None
278
+ - `adafactor`: False
279
+ - `group_by_length`: False
280
+ - `length_column_name`: length
281
+ - `ddp_find_unused_parameters`: None
282
+ - `ddp_bucket_cap_mb`: None
283
+ - `ddp_broadcast_buffers`: False
284
+ - `dataloader_pin_memory`: True
285
+ - `dataloader_persistent_workers`: False
286
+ - `skip_memory_metrics`: True
287
+ - `use_legacy_prediction_loop`: False
288
+ - `push_to_hub`: False
289
+ - `resume_from_checkpoint`: None
290
+ - `hub_model_id`: None
291
+ - `hub_strategy`: every_save
292
+ - `hub_private_repo`: None
293
+ - `hub_always_push`: False
294
+ - `hub_revision`: None
295
+ - `gradient_checkpointing`: False
296
+ - `gradient_checkpointing_kwargs`: None
297
+ - `include_inputs_for_metrics`: False
298
+ - `include_for_metrics`: []
299
+ - `eval_do_concat_batches`: True
300
+ - `fp16_backend`: auto
301
+ - `push_to_hub_model_id`: None
302
+ - `push_to_hub_organization`: None
303
+ - `mp_parameters`:
304
+ - `auto_find_batch_size`: False
305
+ - `full_determinism`: False
306
+ - `torchdynamo`: None
307
+ - `ray_scope`: last
308
+ - `ddp_timeout`: 1800
309
+ - `torch_compile`: False
310
+ - `torch_compile_backend`: None
311
+ - `torch_compile_mode`: None
312
+ - `include_tokens_per_second`: False
313
+ - `include_num_input_tokens_seen`: False
314
+ - `neftune_noise_alpha`: None
315
+ - `optim_target_modules`: None
316
+ - `batch_eval_metrics`: False
317
+ - `eval_on_start`: False
318
+ - `use_liger_kernel`: False
319
+ - `liger_kernel_config`: None
320
+ - `eval_use_gather_object`: False
321
+ - `average_tokens_across_devices`: False
322
+ - `prompts`: None
323
+ - `batch_sampler`: batch_sampler
324
+ - `multi_dataset_batch_sampler`: proportional
325
+
326
+ </details>
327
+
328
+ ### Training Logs
329
+ | Epoch | Step | Training Loss | eval_average_precision |
330
+ |:------:|:----:|:-------------:|:----------------------:|
331
+ | 0.6596 | 500 | 0.3554 | 0.8973 |
332
+ | 1.0 | 758 | - | 0.9116 |
333
+ | 1.3193 | 1000 | 0.2635 | 0.9163 |
334
+ | 1.9789 | 1500 | 0.2561 | 0.9224 |
335
+ | 2.0 | 1516 | - | 0.9227 |
336
+ | 2.6385 | 2000 | 0.2284 | 0.9248 |
337
+ | 3.0 | 2274 | - | 0.9270 |
338
+ | 3.2982 | 2500 | 0.2316 | 0.9275 |
339
+ | 3.9578 | 3000 | 0.2068 | 0.9278 |
340
+ | 4.0 | 3032 | - | 0.9278 |
341
+
342
+
343
+ ### Framework Versions
344
+ - Python: 3.11.13
345
+ - Sentence Transformers: 4.1.0
346
+ - Transformers: 4.54.0
347
+ - PyTorch: 2.6.0+cu124
348
+ - Accelerate: 1.9.0
349
+ - Datasets: 4.0.0
350
+ - Tokenizers: 0.21.2
351
+
352
+ ## Citation
353
+
354
+ ### BibTeX
355
+
356
+ #### Sentence Transformers
357
+ ```bibtex
358
+ @inproceedings{reimers-2019-sentence-bert,
359
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
360
+ author = "Reimers, Nils and Gurevych, Iryna",
361
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
362
+ month = "11",
363
+ year = "2019",
364
+ publisher = "Association for Computational Linguistics",
365
+ url = "https://arxiv.org/abs/1908.10084",
366
+ }
367
+ ```
368
+
369
+ <!--
370
+ ## Glossary
371
+
372
+ *Clearly define terms in order to be accessible across audiences.*
373
+ -->
374
+
375
+ <!--
376
+ ## Model Card Authors
377
+
378
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
379
+ -->
380
+
381
+ <!--
382
+ ## Model Card Contact
383
+
384
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
385
+ -->
config.json ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertForSequenceClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "hidden_act": "gelu",
8
+ "hidden_dropout_prob": 0.1,
9
+ "hidden_size": 768,
10
+ "id2label": {
11
+ "0": "LABEL_0"
12
+ },
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 3072,
15
+ "label2id": {
16
+ "LABEL_0": 0
17
+ },
18
+ "layer_norm_eps": 1e-12,
19
+ "max_position_embeddings": 512,
20
+ "model_type": "bert",
21
+ "num_attention_heads": 12,
22
+ "num_hidden_layers": 12,
23
+ "pad_token_id": 0,
24
+ "position_embedding_type": "absolute",
25
+ "sentence_transformers": {
26
+ "activation_fn": "torch.nn.modules.activation.Sigmoid",
27
+ "version": "4.1.0"
28
+ },
29
+ "torch_dtype": "float32",
30
+ "transformers_version": "4.54.0",
31
+ "type_vocab_size": 2,
32
+ "use_cache": true,
33
+ "vocab_size": 64000
34
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:23f0942abc5f46f29baceacc92142939117c29176b6319110775a67846d8ff8b
3
+ size 540799996
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "5": {
44
+ "content": "[رابط]",
45
+ "lstrip": false,
46
+ "normalized": true,
47
+ "rstrip": false,
48
+ "single_word": true,
49
+ "special": true
50
+ },
51
+ "6": {
52
+ "content": "[بريد]",
53
+ "lstrip": false,
54
+ "normalized": true,
55
+ "rstrip": false,
56
+ "single_word": true,
57
+ "special": true
58
+ },
59
+ "7": {
60
+ "content": "[مستخدم]",
61
+ "lstrip": false,
62
+ "normalized": true,
63
+ "rstrip": false,
64
+ "single_word": true,
65
+ "special": true
66
+ }
67
+ },
68
+ "clean_up_tokenization_spaces": false,
69
+ "cls_token": "[CLS]",
70
+ "do_basic_tokenize": true,
71
+ "do_lower_case": false,
72
+ "extra_special_tokens": {},
73
+ "mask_token": "[MASK]",
74
+ "max_len": 512,
75
+ "max_length": 512,
76
+ "model_max_length": 512,
77
+ "never_split": [
78
+ "[بريد]",
79
+ "[مستخدم]",
80
+ "[رابط]"
81
+ ],
82
+ "pad_to_multiple_of": null,
83
+ "pad_token": "[PAD]",
84
+ "pad_token_type_id": 0,
85
+ "padding_side": "right",
86
+ "sep_token": "[SEP]",
87
+ "stride": 0,
88
+ "strip_accents": null,
89
+ "tokenize_chinese_chars": true,
90
+ "tokenizer_class": "BertTokenizer",
91
+ "truncation_side": "right",
92
+ "truncation_strategy": "longest_first",
93
+ "unk_token": "[UNK]"
94
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff