Hizlan commited on
Commit
87631a9
·
verified ·
1 Parent(s): 956bf31

Upload 7 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,333 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - cross-encoder
5
+ - generated_from_trainer
6
+ - dataset_size:610
7
+ - loss:FitMixinLoss
8
+ base_model: cross-encoder/mmarco-mMiniLMv2-L12-H384-v1
9
+ pipeline_tag: text-ranking
10
+ library_name: sentence-transformers
11
+ ---
12
+
13
+ # CrossEncoder based on cross-encoder/mmarco-mMiniLMv2-L12-H384-v1
14
+
15
+ This is a [Cross Encoder](https://www.sbert.net/docs/cross_encoder/usage/usage.html) model finetuned from [cross-encoder/mmarco-mMiniLMv2-L12-H384-v1](https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1) using the [sentence-transformers](https://www.SBERT.net) library. It computes scores for pairs of texts, which can be used for text reranking and semantic search.
16
+
17
+ ## Model Details
18
+
19
+ ### Model Description
20
+ - **Model Type:** Cross Encoder
21
+ - **Base model:** [cross-encoder/mmarco-mMiniLMv2-L12-H384-v1](https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1) <!-- at revision 1427fd652930e4ba29e8149678df786c240d8825 -->
22
+ - **Maximum Sequence Length:** 512 tokens
23
+ - **Number of Output Labels:** 1 label
24
+ <!-- - **Training Dataset:** Unknown -->
25
+ <!-- - **Language:** Unknown -->
26
+ <!-- - **License:** Unknown -->
27
+
28
+ ### Model Sources
29
+
30
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
31
+ - **Documentation:** [Cross Encoder Documentation](https://www.sbert.net/docs/cross_encoder/usage/usage.html)
32
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
33
+ - **Hugging Face:** [Cross Encoders on Hugging Face](https://huggingface.co/models?library=sentence-transformers&other=cross-encoder)
34
+
35
+ ## Usage
36
+
37
+ ### Direct Usage (Sentence Transformers)
38
+
39
+ First install the Sentence Transformers library:
40
+
41
+ ```bash
42
+ pip install -U sentence-transformers
43
+ ```
44
+
45
+ Then you can load this model and run inference.
46
+ ```python
47
+ from sentence_transformers import CrossEncoder
48
+
49
+ # Download from the 🤗 Hub
50
+ model = CrossEncoder("cross_encoder_model_id")
51
+ # Get scores for pairs of texts
52
+ pairs = [
53
+ ['Die Einzellimiten von Ziffer 3 und 5 jedoch dürfen mit der vorliegenden Limite von 35% nicht kumuliert werden.', 'Die Einzellimiten von Ziffer 3 und 5 jedoch dürfen mit der vorliegenden Limite von 35% nicht kumuliert werden.'],
54
+ ['3. Die Fondsleitung darf einschliesslich der Derivate und strukturierten Produkte höchstens 10% des Vermögens eines Teilvermogens in Effekten und Geldmarktinstrumenten desselben Emittenten anlegen.', '3. Die Fondsleitung darf einschliesslich der Derivate und strukturierten Produkte höchstens 10% des Vermögens eines Teilvermogens in Effekten und Geldmarktinstrumenten desselben Emittenten anlegen.'],
55
+ ['Das Teilvermögen BLKB iQ Fund (CH) iQ Responsible Vorsorge Balanced darf bis zu 100% der Anteile des BLKB iQ Fund (CH) iQ Responsible Bond Fund CHF erwerben.', 'Maximalanteil pro Emission: Höchstens 30 Prozent des Fondsvermögens dürfen in Effekten und Geldmarktinstrumenten derselben Emission angelegt werden.'],
56
+ ['e) Geldmarktinstrumente, wenn diese liquide und bewertbar sind sowie an einer Börse oder an einem anderen geregelten, dem Publikum offenstehenden Markt gehandelt werden; Geldmarktinstrumente, die nicht an einer Börse oder an einem anderen geregelten, dem Publikum offenstehenden Markt gehandelt werden, dürfen nur erworben werden, wenn die Emission oder der Emittent Vorschriften über den Gläubigerund den Anlegerschutz unterliegt und wenn die Geldmarktinstrumente von Emittenten gemäss Artikel 74 Absatz 2 KKV begeben oder garantiert sind.', 'Institution: beaufsichtigte Bank oder Institut: Geldmarktinstrumente müssen von einer Bank, einem Effektenhändler oder einem anderen beaufsichtigten Institut begeben oder garantiert sein, das einer Aufsicht untersteht, die der Schweizer Aufsicht gleichwertig ist.'],
57
+ ['d) Termingeschäfte (Futures und Forwards), deren Wert linear vom Wert des Basiswertes abhängt.', 'Derivate: Derivative Finanzinstrumente sind zulässig, wenn ihnen als Basiswerte Anlagen im Sinne von Artikel 70 Absatz 1 Buchstaben a-d, Finanzindizes, Zinssätze, Wechselkurse, Kredite oder Währungen zugrunde liegen.'],
58
+ ]
59
+ scores = model.predict(pairs)
60
+ print(scores.shape)
61
+ # (5,)
62
+
63
+ # Or rank different texts based on similarity to a single text
64
+ ranks = model.rank(
65
+ 'Die Einzellimiten von Ziffer 3 und 5 jedoch dürfen mit der vorliegenden Limite von 35% nicht kumuliert werden.',
66
+ [
67
+ 'Die Einzellimiten von Ziffer 3 und 5 jedoch dürfen mit der vorliegenden Limite von 35% nicht kumuliert werden.',
68
+ '3. Die Fondsleitung darf einschliesslich der Derivate und strukturierten Produkte höchstens 10% des Vermögens eines Teilvermogens in Effekten und Geldmarktinstrumenten desselben Emittenten anlegen.',
69
+ 'Maximalanteil pro Emission: Höchstens 30 Prozent des Fondsvermögens dürfen in Effekten und Geldmarktinstrumenten derselben Emission angelegt werden.',
70
+ 'Institution: beaufsichtigte Bank oder Institut: Geldmarktinstrumente müssen von einer Bank, einem Effektenhändler oder einem anderen beaufsichtigten Institut begeben oder garantiert sein, das einer Aufsicht untersteht, die der Schweizer Aufsicht gleichwertig ist.',
71
+ 'Derivate: Derivative Finanzinstrumente sind zulässig, wenn ihnen als Basiswerte Anlagen im Sinne von Artikel 70 Absatz 1 Buchstaben a-d, Finanzindizes, Zinssätze, Wechselkurse, Kredite oder Währungen zugrunde liegen.',
72
+ ]
73
+ )
74
+ # [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]
75
+ ```
76
+
77
+ <!--
78
+ ### Direct Usage (Transformers)
79
+
80
+ <details><summary>Click to see the direct usage in Transformers</summary>
81
+
82
+ </details>
83
+ -->
84
+
85
+ <!--
86
+ ### Downstream Usage (Sentence Transformers)
87
+
88
+ You can finetune this model on your own dataset.
89
+
90
+ <details><summary>Click to expand</summary>
91
+
92
+ </details>
93
+ -->
94
+
95
+ <!--
96
+ ### Out-of-Scope Use
97
+
98
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
99
+ -->
100
+
101
+ <!--
102
+ ## Bias, Risks and Limitations
103
+
104
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
105
+ -->
106
+
107
+ <!--
108
+ ### Recommendations
109
+
110
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
111
+ -->
112
+
113
+ ## Training Details
114
+
115
+ ### Training Dataset
116
+
117
+ #### Unnamed Dataset
118
+
119
+ * Size: 610 training samples
120
+ * Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>label</code>
121
+ * Approximate statistics based on the first 610 samples:
122
+ | | sentence_0 | sentence_1 | label |
123
+ |:--------|:--------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------|:--------------------------------------------------------------|
124
+ | type | string | string | float |
125
+ | details | <ul><li>min: 30 characters</li><li>mean: 233.99 characters</li><li>max: 1055 characters</li></ul> | <ul><li>min: 30 characters</li><li>mean: 202.09 characters</li><li>max: 696 characters</li></ul> | <ul><li>min: 0.0</li><li>mean: 0.5</li><li>max: 1.0</li></ul> |
126
+ * Samples:
127
+ | sentence_0 | sentence_1 | label |
128
+ |:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------|
129
+ | <code>Die Einzellimiten von Ziffer 3 und 5 jedoch dürfen mit der vorliegenden Limite von 35% nicht kumuliert werden.</code> | <code>Die Einzellimiten von Ziffer 3 und 5 jedoch dürfen mit der vorliegenden Limite von 35% nicht kumuliert werden.</code> | <code>1.0</code> |
130
+ | <code>3. Die Fondsleitung darf einschliesslich der Derivate und strukturierten Produkte höchstens 10% des Vermögens eines Teilvermogens in Effekten und Geldmarktinstrumenten desselben Emittenten anlegen.</code> | <code>3. Die Fondsleitung darf einschliesslich der Derivate und strukturierten Produkte höchstens 10% des Vermögens eines Teilvermogens in Effekten und Geldmarktinstrumenten desselben Emittenten anlegen.</code> | <code>1.0</code> |
131
+ | <code>Das Teilvermögen BLKB iQ Fund (CH) iQ Responsible Vorsorge Balanced darf bis zu 100% der Anteile des BLKB iQ Fund (CH) iQ Responsible Bond Fund CHF erwerben.</code> | <code>Maximalanteil pro Emission: Höchstens 30 Prozent des Fondsvermögens dürfen in Effekten und Geldmarktinstrumenten derselben Emission angelegt werden.</code> | <code>0.0</code> |
132
+ * Loss: [<code>FitMixinLoss</code>](https://sbert.net/docs/package_reference/cross_encoder/losses.html#fitmixinloss)
133
+
134
+ ### Training Hyperparameters
135
+ #### Non-Default Hyperparameters
136
+
137
+ - `per_device_train_batch_size`: 1
138
+ - `per_device_eval_batch_size`: 1
139
+ - `num_train_epochs`: 20
140
+
141
+ #### All Hyperparameters
142
+ <details><summary>Click to expand</summary>
143
+
144
+ - `overwrite_output_dir`: False
145
+ - `do_predict`: False
146
+ - `eval_strategy`: no
147
+ - `prediction_loss_only`: True
148
+ - `per_device_train_batch_size`: 1
149
+ - `per_device_eval_batch_size`: 1
150
+ - `per_gpu_train_batch_size`: None
151
+ - `per_gpu_eval_batch_size`: None
152
+ - `gradient_accumulation_steps`: 1
153
+ - `eval_accumulation_steps`: None
154
+ - `torch_empty_cache_steps`: None
155
+ - `learning_rate`: 5e-05
156
+ - `weight_decay`: 0.0
157
+ - `adam_beta1`: 0.9
158
+ - `adam_beta2`: 0.999
159
+ - `adam_epsilon`: 1e-08
160
+ - `max_grad_norm`: 1
161
+ - `num_train_epochs`: 20
162
+ - `max_steps`: -1
163
+ - `lr_scheduler_type`: linear
164
+ - `lr_scheduler_kwargs`: {}
165
+ - `warmup_ratio`: 0.0
166
+ - `warmup_steps`: 0
167
+ - `log_level`: passive
168
+ - `log_level_replica`: warning
169
+ - `log_on_each_node`: True
170
+ - `logging_nan_inf_filter`: True
171
+ - `save_safetensors`: True
172
+ - `save_on_each_node`: False
173
+ - `save_only_model`: False
174
+ - `restore_callback_states_from_checkpoint`: False
175
+ - `no_cuda`: False
176
+ - `use_cpu`: False
177
+ - `use_mps_device`: False
178
+ - `seed`: 42
179
+ - `data_seed`: None
180
+ - `jit_mode_eval`: False
181
+ - `use_ipex`: False
182
+ - `bf16`: False
183
+ - `fp16`: False
184
+ - `fp16_opt_level`: O1
185
+ - `half_precision_backend`: auto
186
+ - `bf16_full_eval`: False
187
+ - `fp16_full_eval`: False
188
+ - `tf32`: None
189
+ - `local_rank`: 0
190
+ - `ddp_backend`: None
191
+ - `tpu_num_cores`: None
192
+ - `tpu_metrics_debug`: False
193
+ - `debug`: []
194
+ - `dataloader_drop_last`: False
195
+ - `dataloader_num_workers`: 0
196
+ - `dataloader_prefetch_factor`: None
197
+ - `past_index`: -1
198
+ - `disable_tqdm`: False
199
+ - `remove_unused_columns`: True
200
+ - `label_names`: None
201
+ - `load_best_model_at_end`: False
202
+ - `ignore_data_skip`: False
203
+ - `fsdp`: []
204
+ - `fsdp_min_num_params`: 0
205
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
206
+ - `fsdp_transformer_layer_cls_to_wrap`: None
207
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
208
+ - `deepspeed`: None
209
+ - `label_smoothing_factor`: 0.0
210
+ - `optim`: adamw_torch
211
+ - `optim_args`: None
212
+ - `adafactor`: False
213
+ - `group_by_length`: False
214
+ - `length_column_name`: length
215
+ - `ddp_find_unused_parameters`: None
216
+ - `ddp_bucket_cap_mb`: None
217
+ - `ddp_broadcast_buffers`: False
218
+ - `dataloader_pin_memory`: True
219
+ - `dataloader_persistent_workers`: False
220
+ - `skip_memory_metrics`: True
221
+ - `use_legacy_prediction_loop`: False
222
+ - `push_to_hub`: False
223
+ - `resume_from_checkpoint`: None
224
+ - `hub_model_id`: None
225
+ - `hub_strategy`: every_save
226
+ - `hub_private_repo`: None
227
+ - `hub_always_push`: False
228
+ - `hub_revision`: None
229
+ - `gradient_checkpointing`: False
230
+ - `gradient_checkpointing_kwargs`: None
231
+ - `include_inputs_for_metrics`: False
232
+ - `include_for_metrics`: []
233
+ - `eval_do_concat_batches`: True
234
+ - `fp16_backend`: auto
235
+ - `push_to_hub_model_id`: None
236
+ - `push_to_hub_organization`: None
237
+ - `mp_parameters`:
238
+ - `auto_find_batch_size`: False
239
+ - `full_determinism`: False
240
+ - `torchdynamo`: None
241
+ - `ray_scope`: last
242
+ - `ddp_timeout`: 1800
243
+ - `torch_compile`: False
244
+ - `torch_compile_backend`: None
245
+ - `torch_compile_mode`: None
246
+ - `include_tokens_per_second`: False
247
+ - `include_num_input_tokens_seen`: False
248
+ - `neftune_noise_alpha`: None
249
+ - `optim_target_modules`: None
250
+ - `batch_eval_metrics`: False
251
+ - `eval_on_start`: False
252
+ - `use_liger_kernel`: False
253
+ - `liger_kernel_config`: None
254
+ - `eval_use_gather_object`: False
255
+ - `average_tokens_across_devices`: False
256
+ - `prompts`: None
257
+ - `batch_sampler`: batch_sampler
258
+ - `multi_dataset_batch_sampler`: proportional
259
+
260
+ </details>
261
+
262
+ ### Training Logs
263
+ | Epoch | Step | Training Loss |
264
+ |:-------:|:-----:|:-------------:|
265
+ | 0.8197 | 500 | 1.9143 |
266
+ | 1.6393 | 1000 | 0.7914 |
267
+ | 2.4590 | 1500 | 0.5883 |
268
+ | 3.2787 | 2000 | 0.3915 |
269
+ | 4.0984 | 2500 | 0.2119 |
270
+ | 4.9180 | 3000 | 0.2049 |
271
+ | 5.7377 | 3500 | 0.1157 |
272
+ | 6.5574 | 4000 | 0.1367 |
273
+ | 7.3770 | 4500 | 0.0336 |
274
+ | 8.1967 | 5000 | 0.0912 |
275
+ | 9.0164 | 5500 | 0.0517 |
276
+ | 9.8361 | 6000 | 0.1057 |
277
+ | 10.6557 | 6500 | 0.037 |
278
+ | 11.4754 | 7000 | 0.0875 |
279
+ | 12.2951 | 7500 | 0.057 |
280
+ | 13.1148 | 8000 | 0.0274 |
281
+ | 13.9344 | 8500 | 0.0277 |
282
+ | 14.7541 | 9000 | 0.0133 |
283
+ | 15.5738 | 9500 | 0.0473 |
284
+ | 16.3934 | 10000 | 0.0272 |
285
+ | 17.2131 | 10500 | 0.025 |
286
+ | 18.0328 | 11000 | 0.0481 |
287
+ | 18.8525 | 11500 | 0.0111 |
288
+ | 19.6721 | 12000 | 0.0226 |
289
+
290
+
291
+ ### Framework Versions
292
+ - Python: 3.11.13
293
+ - Sentence Transformers: 4.1.0
294
+ - Transformers: 4.53.3
295
+ - PyTorch: 2.6.0+cu124
296
+ - Accelerate: 1.9.0
297
+ - Datasets: 4.0.0
298
+ - Tokenizers: 0.21.2
299
+
300
+ ## Citation
301
+
302
+ ### BibTeX
303
+
304
+ #### Sentence Transformers
305
+ ```bibtex
306
+ @inproceedings{reimers-2019-sentence-bert,
307
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
308
+ author = "Reimers, Nils and Gurevych, Iryna",
309
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
310
+ month = "11",
311
+ year = "2019",
312
+ publisher = "Association for Computational Linguistics",
313
+ url = "https://arxiv.org/abs/1908.10084",
314
+ }
315
+ ```
316
+
317
+ <!--
318
+ ## Glossary
319
+
320
+ *Clearly define terms in order to be accessible across audiences.*
321
+ -->
322
+
323
+ <!--
324
+ ## Model Card Authors
325
+
326
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
327
+ -->
328
+
329
+ <!--
330
+ ## Model Card Contact
331
+
332
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
333
+ -->
config.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "XLMRobertaForSequenceClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "classifier_dropout": null,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 384,
12
+ "id2label": {
13
+ "0": "LABEL_0"
14
+ },
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 1536,
17
+ "label2id": {
18
+ "LABEL_0": 0
19
+ },
20
+ "layer_norm_eps": 1e-05,
21
+ "max_position_embeddings": 514,
22
+ "model_type": "xlm-roberta",
23
+ "num_attention_heads": 12,
24
+ "num_hidden_layers": 12,
25
+ "pad_token_id": 1,
26
+ "position_embedding_type": "absolute",
27
+ "sentence_transformers": {
28
+ "activation_fn": "torch.nn.modules.linear.Identity",
29
+ "version": "4.1.0"
30
+ },
31
+ "torch_dtype": "float32",
32
+ "transformers_version": "4.53.3",
33
+ "type_vocab_size": 1,
34
+ "use_cache": true,
35
+ "vocab_size": 250002
36
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c206936da6eb9a47d569c7756e7e997840ab610b4e2236d41d9c06f8c8624812
3
+ size 470588492
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:883b037111086fd4dfebbbc9b7cee11e1517b5e0c0514879478661440f137085
3
+ size 17082987
tokenizer_config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": false,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "extra_special_tokens": {},
49
+ "mask_token": "<mask>",
50
+ "model_max_length": 512,
51
+ "pad_token": "<pad>",
52
+ "sep_token": "</s>",
53
+ "tokenizer_class": "XLMRobertaTokenizer",
54
+ "unk_token": "<unk>"
55
+ }