Token Classification
Transformers
Safetensors
openai_privacy_filter
pii
ner
privacy
redaction
multilingual
openmed
openai-privacy-filter
MaziyarPanahi commited on
Commit
7656d47
ยท
verified ยท
1 Parent(s): 3f2710f

Initial publish of OpenMed/privacy-filter-multilingual. Renamed from privacy-filter-openmed-pii54-multilingual; clean README matching OpenMed/privacy-filter-nemotron format. Training artifact JSONs intentionally not redistributed here.

Browse files
Files changed (6) hide show
  1. .gitattributes +1 -0
  2. README.md +211 -0
  3. config.json +770 -0
  4. model.safetensors +3 -0
  5. tokenizer.json +3 -0
  6. tokenizer_config.json +11 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ library_name: transformers
4
+ base_model: openai/privacy-filter
5
+ datasets:
6
+ - ai4privacy/pii-masking-200k
7
+ - ai4privacy/pii-masking-400k
8
+ - ai4privacy/open-pii-masking-500k-ai4privacy
9
+ pipeline_tag: token-classification
10
+ tags:
11
+ - token-classification
12
+ - pii
13
+ - ner
14
+ - privacy
15
+ - redaction
16
+ - multilingual
17
+ - openmed
18
+ - openai-privacy-filter
19
+ language:
20
+ - ar
21
+ - bn
22
+ - de
23
+ - en
24
+ - es
25
+ - fr
26
+ - hi
27
+ - it
28
+ - ja
29
+ - ko
30
+ - nl
31
+ - pt
32
+ - te
33
+ - tr
34
+ - vi
35
+ - zh
36
+ ---
37
+
38
+ # privacy-filter-multilingual
39
+
40
+ Fine-tuned [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)
41
+ for **fine-grained PII extraction** across **54 categories** in **16 languages**.
42
+
43
+ - **Base model**: [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) โ€” 1.4B-parameter MoE (50M active per token), BIOES token-classification head
44
+ - **Task**: Token classification for PII detection (BIOES scheme)
45
+ - **Languages (16)**: Arabic, Bengali, Chinese, Dutch, English, French, German, Hindi, Italian, Japanese, Korean, Portuguese, Spanish, Telugu, Turkish, Vietnamese
46
+ - **Training data**: Multilingual mix from [AI4Privacy](https://huggingface.co/ai4privacy) โ€” `pii-masking-200k`, `pii-masking-400k`, and `open-pii-masking-500k-ai4privacy`, language-balanced
47
+ - **Recipe**: `opf train` (OpenAI's official fine-tuning CLI) โ€” full fine-tune, AdamW, balanced language sampling, 5 epochs, bf16
48
+ - **Labels**: 54 PII categories โ†’ 217 BIOES classes (1 `O` + 54 ร— B/I/E/S)
49
+
50
+ The base model ships with 8 coarse PII categories and English-only training. This
51
+ model trades that for a **6.75ร— more granular vocabulary** spanning identity,
52
+ contact, address, financial, vehicle, digital, and crypto labels โ€” all evaluated
53
+ across 16 languages.
54
+
55
+ > **Family at a glance.** Same architecture, three runtimes:
56
+ > - **PyTorch (this repo)** โ€” CPU + CUDA, anywhere transformers runs.
57
+ > - **MLX BF16** โ€” [`OpenMed/privacy-filter-multilingual-mlx`](https://huggingface.co/OpenMed/privacy-filter-multilingual-mlx) โ€” Apple Silicon, full precision.
58
+ > - **MLX 8-bit** โ€” [`OpenMed/privacy-filter-multilingual-mlx-8bit`](https://huggingface.co/OpenMed/privacy-filter-multilingual-mlx-8bit) โ€” Apple Silicon, smaller + faster.
59
+
60
+ ## Quick start
61
+
62
+ ### With [OpenMed](https://github.com/maziyarpanahi/openmed) โ€” recommended
63
+
64
+ OpenMed gives you `extract_pii()` / `deidentify()` with built-in BIOES Viterbi
65
+ decoding, span refinement, and a Faker-backed obfuscation engine. Same call
66
+ on every host โ€” Apple Silicon picks up MLX automatically; everywhere else uses
67
+ this PyTorch checkpoint.
68
+
69
+ ```bash
70
+ pip install -U "openmed[hf]"
71
+ ```
72
+
73
+ ```python
74
+ from openmed import extract_pii, deidentify
75
+
76
+ text = (
77
+ "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, "
78
+ "phone 415-555-0123, email sarah.johnson@example.com."
79
+ )
80
+
81
+ # Extract grouped entity spans
82
+ result = extract_pii(text, model_name="OpenMed/privacy-filter-multilingual")
83
+ for ent in result.entities:
84
+ print(f"{ent.label:30s} {ent.text!r} conf={ent.confidence:.2f}")
85
+
86
+ # De-identify with any of the supported methods
87
+ masked = deidentify(text, method="mask", model_name="OpenMed/privacy-filter-multilingual")
88
+ removed = deidentify(text, method="remove", model_name="OpenMed/privacy-filter-multilingual")
89
+ hashed = deidentify(text, method="hash", model_name="OpenMed/privacy-filter-multilingual")
90
+
91
+ # Faker-backed locale-aware obfuscation, deterministic with consistent=True+seed
92
+ fake = deidentify(
93
+ text,
94
+ method="replace",
95
+ model_name="OpenMed/privacy-filter-multilingual",
96
+ consistent=True,
97
+ seed=42,
98
+ )
99
+ print(fake.deidentified_text)
100
+ ```
101
+
102
+ `OpenMed/privacy-filter-multilingual-mlx*` model names also work in the same
103
+ `extract_pii()` / `deidentify()` calls โ€” on a non-Apple-Silicon host they
104
+ automatically fall back to **this PyTorch checkpoint** with a one-time warning.
105
+ So you can ship MLX names in code and still run on Linux/Windows.
106
+
107
+ The OpenMed wrapper passes `trust_remote_code=True` for you, runs the model's
108
+ own BIOES Viterbi decoder, and skips OpenMed's regex smart-merging (the model
109
+ already produces clean spans).
110
+
111
+ ## Label space (54 categories)
112
+
113
+ | Category | Typical examples |
114
+ |---|---|
115
+ | **Identity** | `FIRSTNAME`, `MIDDLENAME`, `LASTNAME`, `PREFIX`, `AGE`, `GENDER`, `SEX`, `EYECOLOR`, `HEIGHT`, `USERNAME`, `OCCUPATION`, `JOBTITLE`, `JOBDEPARTMENT`, `ORGANIZATION`, `USERAGENT` |
116
+ | **Contact** | `EMAIL`, `PHONE`, `URL` |
117
+ | **Address** | `STREET`, `BUILDINGNUMBER`, `SECONDARYADDRESS`, `CITY`, `COUNTY`, `STATE`, `ZIPCODE`, `GPSCOORDINATES`, `ORDINALDIRECTION` |
118
+ | **Dates & time** | `DATE`, `DATEOFBIRTH`, `TIME` |
119
+ | **Government IDs** | `SSN` |
120
+ | **Financial** | `ACCOUNTNAME`, `BANKACCOUNT`, `IBAN`, `BIC`, `CREDITCARD`, `CREDITCARDISSUER`, `CVV`, `PIN`, `MASKEDNUMBER`, `AMOUNT`, `CURRENCY`, `CURRENCYCODE`, `CURRENCYNAME`, `CURRENCYSYMBOL` |
121
+ | **Crypto** | `BITCOINADDRESS`, `ETHEREUMADDRESS`, `LITECOINADDRESS` |
122
+ | **Vehicle** | `VIN`, `VRM` |
123
+ | **Digital** | `IPADDRESS`, `MACADDRESS`, `IMEI` |
124
+ | **Auth** | `PASSWORD` |
125
+
126
+ The output space is `O` plus `B-`, `I-`, `E-`, `S-` for each of the 54 categories
127
+ (4 ร— 54 + 1 = 217). The `id2label` mapping is shipped with the model.
128
+
129
+ ## Limitations & intended use
130
+
131
+ - **Multilingual but uneven.** Strongest on languages with rich PII training
132
+ data (German, Spanish, French, Italian, Hindi, Telugu, English). CJK languages
133
+ (Japanese, Korean, Chinese) and some morphologically-marked low-resource
134
+ languages remain the main bottleneck on the current training mix.
135
+ - **Synthetic training data.** The AI4Privacy datasets are template-synthesized;
136
+ real clinical notes, legal documents, and web text may show different
137
+ surface forms. For high-stakes deployments, collect a domain-specific eval
138
+ set and re-calibrate thresholds.
139
+ - **Not a substitute for legal compliance review.** Use alongside a governance
140
+ layer (human review, deterministic regex pre-filters, etc.).
141
+ - **Not a clinical PHI model.** Healthcare-specific PHI and clinical entity
142
+ training is planned as a separate branch.
143
+
144
+ **Head initialization**: `opf`'s default "copy-from-matching-base" head init.
145
+ Of the 217 new BIOES classes, the few with exact base-vocabulary matches
146
+ (`O`, `B/I/E/S-account_name`, etc.) were copied directly; the rest were copied
147
+ from semantically-adjacent coarse rows and fine-tuned end-to-end.
148
+
149
+ **Router**: base model has 128 MoE experts per layer with top-4 routing.
150
+ Routers were kept trainable during full fine-tuning; no collapse was observed.
151
+
152
+ ## Credits & Acknowledgements
153
+
154
+ This model wouldn't exist without two open-source releases โ€” sincere thanks
155
+ to both teams:
156
+
157
+ - **OpenAI** for [open-sourcing the Privacy Filter](https://huggingface.co/openai/privacy-filter)
158
+ (architecture, modeling code, and `opf` training/eval CLI). Everything in
159
+ this repo is a fine-tune on top of that release.
160
+ - **AI4Privacy** for releasing the multilingual PII masking datasets used as
161
+ training data:
162
+ [`pii-masking-200k`](https://huggingface.co/datasets/ai4privacy/pii-masking-200k),
163
+ [`pii-masking-400k`](https://huggingface.co/datasets/ai4privacy/pii-masking-400k),
164
+ [`open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy).
165
+
166
+ Additional thanks to the **HuggingFace** team for the `transformers` /
167
+ `huggingface_hub` ecosystem this model ships through.
168
+
169
+ ## License
170
+
171
+ Inherits the licensing of its inputs. The base model `openai/privacy-filter`
172
+ is Apache 2.0; the AI4Privacy datasets carry their own terms โ€” please consult
173
+ those sources before redistribution.
174
+
175
+ ## Citation
176
+
177
+ If you use this model, please cite **this model**, the organization behind it
178
+ (**OpenMed**), and the upstream base model + datasets:
179
+
180
+ ```bibtex
181
+ @misc{openmed_privacy_filter_multilingual_2026,
182
+ author = {OpenMed},
183
+ title = {{OpenMed/privacy-filter-multilingual}: multilingual fine-grained PII extraction across 16 languages and 54 categories},
184
+ year = {2026},
185
+ publisher = {Hugging Face},
186
+ howpublished = {\url{https://huggingface.co/OpenMed/privacy-filter-multilingual}}
187
+ }
188
+
189
+ @misc{openmed_2026,
190
+ author = {OpenMed},
191
+ title = {{OpenMed}: open models and resources for healthcare NLP},
192
+ year = {2026},
193
+ publisher = {Hugging Face},
194
+ howpublished = {\url{https://huggingface.co/OpenMed}}
195
+ }
196
+
197
+ @misc{openai_privacy_filter_2025,
198
+ author = {OpenAI},
199
+ title = {{openai/privacy-filter}},
200
+ year = {2025},
201
+ publisher = {Hugging Face},
202
+ howpublished = {\url{https://huggingface.co/openai/privacy-filter}}
203
+ }
204
+
205
+ @misc{ai4privacy_pii_masking,
206
+ author = {AI4Privacy},
207
+ title = {{AI4Privacy PII Masking Datasets}},
208
+ publisher = {Hugging Face},
209
+ howpublished = {\url{https://huggingface.co/ai4privacy}}
210
+ }
211
+ ```
config.json ADDED
@@ -0,0 +1,770 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "OpenAIPrivacyFilterForTokenClassification"
4
+ ],
5
+ "attention_bias": true,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": null,
8
+ "classifier_dropout": 0.0,
9
+ "default_n_ctx": 128000,
10
+ "dtype": "bfloat16",
11
+ "eos_token_id": 199999,
12
+ "head_dim": 64,
13
+ "hidden_act": "silu",
14
+ "hidden_size": 640,
15
+ "id2label": {
16
+ "0": "O",
17
+ "1": "B-ACCOUNTNAME",
18
+ "2": "I-ACCOUNTNAME",
19
+ "3": "E-ACCOUNTNAME",
20
+ "4": "S-ACCOUNTNAME",
21
+ "5": "B-AGE",
22
+ "6": "I-AGE",
23
+ "7": "E-AGE",
24
+ "8": "S-AGE",
25
+ "9": "B-AMOUNT",
26
+ "10": "I-AMOUNT",
27
+ "11": "E-AMOUNT",
28
+ "12": "S-AMOUNT",
29
+ "13": "B-BANKACCOUNT",
30
+ "14": "I-BANKACCOUNT",
31
+ "15": "E-BANKACCOUNT",
32
+ "16": "S-BANKACCOUNT",
33
+ "17": "B-BIC",
34
+ "18": "I-BIC",
35
+ "19": "E-BIC",
36
+ "20": "S-BIC",
37
+ "21": "B-BITCOINADDRESS",
38
+ "22": "I-BITCOINADDRESS",
39
+ "23": "E-BITCOINADDRESS",
40
+ "24": "S-BITCOINADDRESS",
41
+ "25": "B-BUILDINGNUMBER",
42
+ "26": "I-BUILDINGNUMBER",
43
+ "27": "E-BUILDINGNUMBER",
44
+ "28": "S-BUILDINGNUMBER",
45
+ "29": "B-CITY",
46
+ "30": "I-CITY",
47
+ "31": "E-CITY",
48
+ "32": "S-CITY",
49
+ "33": "B-COUNTY",
50
+ "34": "I-COUNTY",
51
+ "35": "E-COUNTY",
52
+ "36": "S-COUNTY",
53
+ "37": "B-CREDITCARD",
54
+ "38": "I-CREDITCARD",
55
+ "39": "E-CREDITCARD",
56
+ "40": "S-CREDITCARD",
57
+ "41": "B-CREDITCARDISSUER",
58
+ "42": "I-CREDITCARDISSUER",
59
+ "43": "E-CREDITCARDISSUER",
60
+ "44": "S-CREDITCARDISSUER",
61
+ "45": "B-CURRENCY",
62
+ "46": "I-CURRENCY",
63
+ "47": "E-CURRENCY",
64
+ "48": "S-CURRENCY",
65
+ "49": "B-CURRENCYCODE",
66
+ "50": "I-CURRENCYCODE",
67
+ "51": "E-CURRENCYCODE",
68
+ "52": "S-CURRENCYCODE",
69
+ "53": "B-CURRENCYNAME",
70
+ "54": "I-CURRENCYNAME",
71
+ "55": "E-CURRENCYNAME",
72
+ "56": "S-CURRENCYNAME",
73
+ "57": "B-CURRENCYSYMBOL",
74
+ "58": "I-CURRENCYSYMBOL",
75
+ "59": "E-CURRENCYSYMBOL",
76
+ "60": "S-CURRENCYSYMBOL",
77
+ "61": "B-CVV",
78
+ "62": "I-CVV",
79
+ "63": "E-CVV",
80
+ "64": "S-CVV",
81
+ "65": "B-DATE",
82
+ "66": "I-DATE",
83
+ "67": "E-DATE",
84
+ "68": "S-DATE",
85
+ "69": "B-DATEOFBIRTH",
86
+ "70": "I-DATEOFBIRTH",
87
+ "71": "E-DATEOFBIRTH",
88
+ "72": "S-DATEOFBIRTH",
89
+ "73": "B-EMAIL",
90
+ "74": "I-EMAIL",
91
+ "75": "E-EMAIL",
92
+ "76": "S-EMAIL",
93
+ "77": "B-ETHEREUMADDRESS",
94
+ "78": "I-ETHEREUMADDRESS",
95
+ "79": "E-ETHEREUMADDRESS",
96
+ "80": "S-ETHEREUMADDRESS",
97
+ "81": "B-EYECOLOR",
98
+ "82": "I-EYECOLOR",
99
+ "83": "E-EYECOLOR",
100
+ "84": "S-EYECOLOR",
101
+ "85": "B-FIRSTNAME",
102
+ "86": "I-FIRSTNAME",
103
+ "87": "E-FIRSTNAME",
104
+ "88": "S-FIRSTNAME",
105
+ "89": "B-GENDER",
106
+ "90": "I-GENDER",
107
+ "91": "E-GENDER",
108
+ "92": "S-GENDER",
109
+ "93": "B-GPSCOORDINATES",
110
+ "94": "I-GPSCOORDINATES",
111
+ "95": "E-GPSCOORDINATES",
112
+ "96": "S-GPSCOORDINATES",
113
+ "97": "B-HEIGHT",
114
+ "98": "I-HEIGHT",
115
+ "99": "E-HEIGHT",
116
+ "100": "S-HEIGHT",
117
+ "101": "B-IBAN",
118
+ "102": "I-IBAN",
119
+ "103": "E-IBAN",
120
+ "104": "S-IBAN",
121
+ "105": "B-IMEI",
122
+ "106": "I-IMEI",
123
+ "107": "E-IMEI",
124
+ "108": "S-IMEI",
125
+ "109": "B-IPADDRESS",
126
+ "110": "I-IPADDRESS",
127
+ "111": "E-IPADDRESS",
128
+ "112": "S-IPADDRESS",
129
+ "113": "B-JOBDEPARTMENT",
130
+ "114": "I-JOBDEPARTMENT",
131
+ "115": "E-JOBDEPARTMENT",
132
+ "116": "S-JOBDEPARTMENT",
133
+ "117": "B-JOBTITLE",
134
+ "118": "I-JOBTITLE",
135
+ "119": "E-JOBTITLE",
136
+ "120": "S-JOBTITLE",
137
+ "121": "B-LASTNAME",
138
+ "122": "I-LASTNAME",
139
+ "123": "E-LASTNAME",
140
+ "124": "S-LASTNAME",
141
+ "125": "B-LITECOINADDRESS",
142
+ "126": "I-LITECOINADDRESS",
143
+ "127": "E-LITECOINADDRESS",
144
+ "128": "S-LITECOINADDRESS",
145
+ "129": "B-MACADDRESS",
146
+ "130": "I-MACADDRESS",
147
+ "131": "E-MACADDRESS",
148
+ "132": "S-MACADDRESS",
149
+ "133": "B-MASKEDNUMBER",
150
+ "134": "I-MASKEDNUMBER",
151
+ "135": "E-MASKEDNUMBER",
152
+ "136": "S-MASKEDNUMBER",
153
+ "137": "B-MIDDLENAME",
154
+ "138": "I-MIDDLENAME",
155
+ "139": "E-MIDDLENAME",
156
+ "140": "S-MIDDLENAME",
157
+ "141": "B-OCCUPATION",
158
+ "142": "I-OCCUPATION",
159
+ "143": "E-OCCUPATION",
160
+ "144": "S-OCCUPATION",
161
+ "145": "B-ORDINALDIRECTION",
162
+ "146": "I-ORDINALDIRECTION",
163
+ "147": "E-ORDINALDIRECTION",
164
+ "148": "S-ORDINALDIRECTION",
165
+ "149": "B-ORGANIZATION",
166
+ "150": "I-ORGANIZATION",
167
+ "151": "E-ORGANIZATION",
168
+ "152": "S-ORGANIZATION",
169
+ "153": "B-PASSWORD",
170
+ "154": "I-PASSWORD",
171
+ "155": "E-PASSWORD",
172
+ "156": "S-PASSWORD",
173
+ "157": "B-PHONE",
174
+ "158": "I-PHONE",
175
+ "159": "E-PHONE",
176
+ "160": "S-PHONE",
177
+ "161": "B-PIN",
178
+ "162": "I-PIN",
179
+ "163": "E-PIN",
180
+ "164": "S-PIN",
181
+ "165": "B-PREFIX",
182
+ "166": "I-PREFIX",
183
+ "167": "E-PREFIX",
184
+ "168": "S-PREFIX",
185
+ "169": "B-SECONDARYADDRESS",
186
+ "170": "I-SECONDARYADDRESS",
187
+ "171": "E-SECONDARYADDRESS",
188
+ "172": "S-SECONDARYADDRESS",
189
+ "173": "B-SEX",
190
+ "174": "I-SEX",
191
+ "175": "E-SEX",
192
+ "176": "S-SEX",
193
+ "177": "B-SSN",
194
+ "178": "I-SSN",
195
+ "179": "E-SSN",
196
+ "180": "S-SSN",
197
+ "181": "B-STATE",
198
+ "182": "I-STATE",
199
+ "183": "E-STATE",
200
+ "184": "S-STATE",
201
+ "185": "B-STREET",
202
+ "186": "I-STREET",
203
+ "187": "E-STREET",
204
+ "188": "S-STREET",
205
+ "189": "B-TIME",
206
+ "190": "I-TIME",
207
+ "191": "E-TIME",
208
+ "192": "S-TIME",
209
+ "193": "B-URL",
210
+ "194": "I-URL",
211
+ "195": "E-URL",
212
+ "196": "S-URL",
213
+ "197": "B-USERAGENT",
214
+ "198": "I-USERAGENT",
215
+ "199": "E-USERAGENT",
216
+ "200": "S-USERAGENT",
217
+ "201": "B-USERNAME",
218
+ "202": "I-USERNAME",
219
+ "203": "E-USERNAME",
220
+ "204": "S-USERNAME",
221
+ "205": "B-VIN",
222
+ "206": "I-VIN",
223
+ "207": "E-VIN",
224
+ "208": "S-VIN",
225
+ "209": "B-VRM",
226
+ "210": "I-VRM",
227
+ "211": "E-VRM",
228
+ "212": "S-VRM",
229
+ "213": "B-ZIPCODE",
230
+ "214": "I-ZIPCODE",
231
+ "215": "E-ZIPCODE",
232
+ "216": "S-ZIPCODE"
233
+ },
234
+ "initial_context_length": 4096,
235
+ "initializer_range": 0.02,
236
+ "intermediate_size": 640,
237
+ "label2id": {
238
+ "O": 0,
239
+ "B-ACCOUNTNAME": 1,
240
+ "I-ACCOUNTNAME": 2,
241
+ "E-ACCOUNTNAME": 3,
242
+ "S-ACCOUNTNAME": 4,
243
+ "B-AGE": 5,
244
+ "I-AGE": 6,
245
+ "E-AGE": 7,
246
+ "S-AGE": 8,
247
+ "B-AMOUNT": 9,
248
+ "I-AMOUNT": 10,
249
+ "E-AMOUNT": 11,
250
+ "S-AMOUNT": 12,
251
+ "B-BANKACCOUNT": 13,
252
+ "I-BANKACCOUNT": 14,
253
+ "E-BANKACCOUNT": 15,
254
+ "S-BANKACCOUNT": 16,
255
+ "B-BIC": 17,
256
+ "I-BIC": 18,
257
+ "E-BIC": 19,
258
+ "S-BIC": 20,
259
+ "B-BITCOINADDRESS": 21,
260
+ "I-BITCOINADDRESS": 22,
261
+ "E-BITCOINADDRESS": 23,
262
+ "S-BITCOINADDRESS": 24,
263
+ "B-BUILDINGNUMBER": 25,
264
+ "I-BUILDINGNUMBER": 26,
265
+ "E-BUILDINGNUMBER": 27,
266
+ "S-BUILDINGNUMBER": 28,
267
+ "B-CITY": 29,
268
+ "I-CITY": 30,
269
+ "E-CITY": 31,
270
+ "S-CITY": 32,
271
+ "B-COUNTY": 33,
272
+ "I-COUNTY": 34,
273
+ "E-COUNTY": 35,
274
+ "S-COUNTY": 36,
275
+ "B-CREDITCARD": 37,
276
+ "I-CREDITCARD": 38,
277
+ "E-CREDITCARD": 39,
278
+ "S-CREDITCARD": 40,
279
+ "B-CREDITCARDISSUER": 41,
280
+ "I-CREDITCARDISSUER": 42,
281
+ "E-CREDITCARDISSUER": 43,
282
+ "S-CREDITCARDISSUER": 44,
283
+ "B-CURRENCY": 45,
284
+ "I-CURRENCY": 46,
285
+ "E-CURRENCY": 47,
286
+ "S-CURRENCY": 48,
287
+ "B-CURRENCYCODE": 49,
288
+ "I-CURRENCYCODE": 50,
289
+ "E-CURRENCYCODE": 51,
290
+ "S-CURRENCYCODE": 52,
291
+ "B-CURRENCYNAME": 53,
292
+ "I-CURRENCYNAME": 54,
293
+ "E-CURRENCYNAME": 55,
294
+ "S-CURRENCYNAME": 56,
295
+ "B-CURRENCYSYMBOL": 57,
296
+ "I-CURRENCYSYMBOL": 58,
297
+ "E-CURRENCYSYMBOL": 59,
298
+ "S-CURRENCYSYMBOL": 60,
299
+ "B-CVV": 61,
300
+ "I-CVV": 62,
301
+ "E-CVV": 63,
302
+ "S-CVV": 64,
303
+ "B-DATE": 65,
304
+ "I-DATE": 66,
305
+ "E-DATE": 67,
306
+ "S-DATE": 68,
307
+ "B-DATEOFBIRTH": 69,
308
+ "I-DATEOFBIRTH": 70,
309
+ "E-DATEOFBIRTH": 71,
310
+ "S-DATEOFBIRTH": 72,
311
+ "B-EMAIL": 73,
312
+ "I-EMAIL": 74,
313
+ "E-EMAIL": 75,
314
+ "S-EMAIL": 76,
315
+ "B-ETHEREUMADDRESS": 77,
316
+ "I-ETHEREUMADDRESS": 78,
317
+ "E-ETHEREUMADDRESS": 79,
318
+ "S-ETHEREUMADDRESS": 80,
319
+ "B-EYECOLOR": 81,
320
+ "I-EYECOLOR": 82,
321
+ "E-EYECOLOR": 83,
322
+ "S-EYECOLOR": 84,
323
+ "B-FIRSTNAME": 85,
324
+ "I-FIRSTNAME": 86,
325
+ "E-FIRSTNAME": 87,
326
+ "S-FIRSTNAME": 88,
327
+ "B-GENDER": 89,
328
+ "I-GENDER": 90,
329
+ "E-GENDER": 91,
330
+ "S-GENDER": 92,
331
+ "B-GPSCOORDINATES": 93,
332
+ "I-GPSCOORDINATES": 94,
333
+ "E-GPSCOORDINATES": 95,
334
+ "S-GPSCOORDINATES": 96,
335
+ "B-HEIGHT": 97,
336
+ "I-HEIGHT": 98,
337
+ "E-HEIGHT": 99,
338
+ "S-HEIGHT": 100,
339
+ "B-IBAN": 101,
340
+ "I-IBAN": 102,
341
+ "E-IBAN": 103,
342
+ "S-IBAN": 104,
343
+ "B-IMEI": 105,
344
+ "I-IMEI": 106,
345
+ "E-IMEI": 107,
346
+ "S-IMEI": 108,
347
+ "B-IPADDRESS": 109,
348
+ "I-IPADDRESS": 110,
349
+ "E-IPADDRESS": 111,
350
+ "S-IPADDRESS": 112,
351
+ "B-JOBDEPARTMENT": 113,
352
+ "I-JOBDEPARTMENT": 114,
353
+ "E-JOBDEPARTMENT": 115,
354
+ "S-JOBDEPARTMENT": 116,
355
+ "B-JOBTITLE": 117,
356
+ "I-JOBTITLE": 118,
357
+ "E-JOBTITLE": 119,
358
+ "S-JOBTITLE": 120,
359
+ "B-LASTNAME": 121,
360
+ "I-LASTNAME": 122,
361
+ "E-LASTNAME": 123,
362
+ "S-LASTNAME": 124,
363
+ "B-LITECOINADDRESS": 125,
364
+ "I-LITECOINADDRESS": 126,
365
+ "E-LITECOINADDRESS": 127,
366
+ "S-LITECOINADDRESS": 128,
367
+ "B-MACADDRESS": 129,
368
+ "I-MACADDRESS": 130,
369
+ "E-MACADDRESS": 131,
370
+ "S-MACADDRESS": 132,
371
+ "B-MASKEDNUMBER": 133,
372
+ "I-MASKEDNUMBER": 134,
373
+ "E-MASKEDNUMBER": 135,
374
+ "S-MASKEDNUMBER": 136,
375
+ "B-MIDDLENAME": 137,
376
+ "I-MIDDLENAME": 138,
377
+ "E-MIDDLENAME": 139,
378
+ "S-MIDDLENAME": 140,
379
+ "B-OCCUPATION": 141,
380
+ "I-OCCUPATION": 142,
381
+ "E-OCCUPATION": 143,
382
+ "S-OCCUPATION": 144,
383
+ "B-ORDINALDIRECTION": 145,
384
+ "I-ORDINALDIRECTION": 146,
385
+ "E-ORDINALDIRECTION": 147,
386
+ "S-ORDINALDIRECTION": 148,
387
+ "B-ORGANIZATION": 149,
388
+ "I-ORGANIZATION": 150,
389
+ "E-ORGANIZATION": 151,
390
+ "S-ORGANIZATION": 152,
391
+ "B-PASSWORD": 153,
392
+ "I-PASSWORD": 154,
393
+ "E-PASSWORD": 155,
394
+ "S-PASSWORD": 156,
395
+ "B-PHONE": 157,
396
+ "I-PHONE": 158,
397
+ "E-PHONE": 159,
398
+ "S-PHONE": 160,
399
+ "B-PIN": 161,
400
+ "I-PIN": 162,
401
+ "E-PIN": 163,
402
+ "S-PIN": 164,
403
+ "B-PREFIX": 165,
404
+ "I-PREFIX": 166,
405
+ "E-PREFIX": 167,
406
+ "S-PREFIX": 168,
407
+ "B-SECONDARYADDRESS": 169,
408
+ "I-SECONDARYADDRESS": 170,
409
+ "E-SECONDARYADDRESS": 171,
410
+ "S-SECONDARYADDRESS": 172,
411
+ "B-SEX": 173,
412
+ "I-SEX": 174,
413
+ "E-SEX": 175,
414
+ "S-SEX": 176,
415
+ "B-SSN": 177,
416
+ "I-SSN": 178,
417
+ "E-SSN": 179,
418
+ "S-SSN": 180,
419
+ "B-STATE": 181,
420
+ "I-STATE": 182,
421
+ "E-STATE": 183,
422
+ "S-STATE": 184,
423
+ "B-STREET": 185,
424
+ "I-STREET": 186,
425
+ "E-STREET": 187,
426
+ "S-STREET": 188,
427
+ "B-TIME": 189,
428
+ "I-TIME": 190,
429
+ "E-TIME": 191,
430
+ "S-TIME": 192,
431
+ "B-URL": 193,
432
+ "I-URL": 194,
433
+ "E-URL": 195,
434
+ "S-URL": 196,
435
+ "B-USERAGENT": 197,
436
+ "I-USERAGENT": 198,
437
+ "E-USERAGENT": 199,
438
+ "S-USERAGENT": 200,
439
+ "B-USERNAME": 201,
440
+ "I-USERNAME": 202,
441
+ "E-USERNAME": 203,
442
+ "S-USERNAME": 204,
443
+ "B-VIN": 205,
444
+ "I-VIN": 206,
445
+ "E-VIN": 207,
446
+ "S-VIN": 208,
447
+ "B-VRM": 209,
448
+ "I-VRM": 210,
449
+ "E-VRM": 211,
450
+ "S-VRM": 212,
451
+ "B-ZIPCODE": 213,
452
+ "I-ZIPCODE": 214,
453
+ "E-ZIPCODE": 215,
454
+ "S-ZIPCODE": 216
455
+ },
456
+ "max_position_embeddings": 131072,
457
+ "model_type": "openai_privacy_filter",
458
+ "num_attention_heads": 14,
459
+ "num_experts_per_tok": 4,
460
+ "num_hidden_layers": 8,
461
+ "num_key_value_heads": 2,
462
+ "num_local_experts": 128,
463
+ "output_router_logits": false,
464
+ "pad_token_id": 199999,
465
+ "rms_norm_eps": 1e-05,
466
+ "rope_parameters": {
467
+ "beta_fast": 32.0,
468
+ "beta_slow": 1.0,
469
+ "factor": 32.0,
470
+ "original_max_position_embeddings": 4096,
471
+ "rope_theta": 150000.0,
472
+ "rope_type": "yarn",
473
+ "truncate": false
474
+ },
475
+ "router_aux_loss_coef": 0.001,
476
+ "sliding_window": 128,
477
+ "tie_word_embeddings": false,
478
+ "transformers_version": "5.6.0.dev0",
479
+ "use_cache": true,
480
+ "vocab_size": 200064,
481
+ "transformers.js_config": {
482
+ "use_external_data_format": {
483
+ "model.onnx": 3,
484
+ "model_fp16.onnx": 2,
485
+ "model": 1
486
+ }
487
+ },
488
+ "num_labels": 217,
489
+ "opf_metadata": {
490
+ "category_version": "openmed_pii54_multi_v1",
491
+ "encoding": "o200k_base",
492
+ "span_class_names": [
493
+ "O",
494
+ "ACCOUNTNAME",
495
+ "AGE",
496
+ "AMOUNT",
497
+ "BANKACCOUNT",
498
+ "BIC",
499
+ "BITCOINADDRESS",
500
+ "BUILDINGNUMBER",
501
+ "CITY",
502
+ "COUNTY",
503
+ "CREDITCARD",
504
+ "CREDITCARDISSUER",
505
+ "CURRENCY",
506
+ "CURRENCYCODE",
507
+ "CURRENCYNAME",
508
+ "CURRENCYSYMBOL",
509
+ "CVV",
510
+ "DATE",
511
+ "DATEOFBIRTH",
512
+ "EMAIL",
513
+ "ETHEREUMADDRESS",
514
+ "EYECOLOR",
515
+ "FIRSTNAME",
516
+ "GENDER",
517
+ "GPSCOORDINATES",
518
+ "HEIGHT",
519
+ "IBAN",
520
+ "IMEI",
521
+ "IPADDRESS",
522
+ "JOBDEPARTMENT",
523
+ "JOBTITLE",
524
+ "LASTNAME",
525
+ "LITECOINADDRESS",
526
+ "MACADDRESS",
527
+ "MASKEDNUMBER",
528
+ "MIDDLENAME",
529
+ "OCCUPATION",
530
+ "ORDINALDIRECTION",
531
+ "ORGANIZATION",
532
+ "PASSWORD",
533
+ "PHONE",
534
+ "PIN",
535
+ "PREFIX",
536
+ "SECONDARYADDRESS",
537
+ "SEX",
538
+ "SSN",
539
+ "STATE",
540
+ "STREET",
541
+ "TIME",
542
+ "URL",
543
+ "USERAGENT",
544
+ "USERNAME",
545
+ "VIN",
546
+ "VRM",
547
+ "ZIPCODE"
548
+ ],
549
+ "ner_class_names": [
550
+ "O",
551
+ "B-ACCOUNTNAME",
552
+ "I-ACCOUNTNAME",
553
+ "E-ACCOUNTNAME",
554
+ "S-ACCOUNTNAME",
555
+ "B-AGE",
556
+ "I-AGE",
557
+ "E-AGE",
558
+ "S-AGE",
559
+ "B-AMOUNT",
560
+ "I-AMOUNT",
561
+ "E-AMOUNT",
562
+ "S-AMOUNT",
563
+ "B-BANKACCOUNT",
564
+ "I-BANKACCOUNT",
565
+ "E-BANKACCOUNT",
566
+ "S-BANKACCOUNT",
567
+ "B-BIC",
568
+ "I-BIC",
569
+ "E-BIC",
570
+ "S-BIC",
571
+ "B-BITCOINADDRESS",
572
+ "I-BITCOINADDRESS",
573
+ "E-BITCOINADDRESS",
574
+ "S-BITCOINADDRESS",
575
+ "B-BUILDINGNUMBER",
576
+ "I-BUILDINGNUMBER",
577
+ "E-BUILDINGNUMBER",
578
+ "S-BUILDINGNUMBER",
579
+ "B-CITY",
580
+ "I-CITY",
581
+ "E-CITY",
582
+ "S-CITY",
583
+ "B-COUNTY",
584
+ "I-COUNTY",
585
+ "E-COUNTY",
586
+ "S-COUNTY",
587
+ "B-CREDITCARD",
588
+ "I-CREDITCARD",
589
+ "E-CREDITCARD",
590
+ "S-CREDITCARD",
591
+ "B-CREDITCARDISSUER",
592
+ "I-CREDITCARDISSUER",
593
+ "E-CREDITCARDISSUER",
594
+ "S-CREDITCARDISSUER",
595
+ "B-CURRENCY",
596
+ "I-CURRENCY",
597
+ "E-CURRENCY",
598
+ "S-CURRENCY",
599
+ "B-CURRENCYCODE",
600
+ "I-CURRENCYCODE",
601
+ "E-CURRENCYCODE",
602
+ "S-CURRENCYCODE",
603
+ "B-CURRENCYNAME",
604
+ "I-CURRENCYNAME",
605
+ "E-CURRENCYNAME",
606
+ "S-CURRENCYNAME",
607
+ "B-CURRENCYSYMBOL",
608
+ "I-CURRENCYSYMBOL",
609
+ "E-CURRENCYSYMBOL",
610
+ "S-CURRENCYSYMBOL",
611
+ "B-CVV",
612
+ "I-CVV",
613
+ "E-CVV",
614
+ "S-CVV",
615
+ "B-DATE",
616
+ "I-DATE",
617
+ "E-DATE",
618
+ "S-DATE",
619
+ "B-DATEOFBIRTH",
620
+ "I-DATEOFBIRTH",
621
+ "E-DATEOFBIRTH",
622
+ "S-DATEOFBIRTH",
623
+ "B-EMAIL",
624
+ "I-EMAIL",
625
+ "E-EMAIL",
626
+ "S-EMAIL",
627
+ "B-ETHEREUMADDRESS",
628
+ "I-ETHEREUMADDRESS",
629
+ "E-ETHEREUMADDRESS",
630
+ "S-ETHEREUMADDRESS",
631
+ "B-EYECOLOR",
632
+ "I-EYECOLOR",
633
+ "E-EYECOLOR",
634
+ "S-EYECOLOR",
635
+ "B-FIRSTNAME",
636
+ "I-FIRSTNAME",
637
+ "E-FIRSTNAME",
638
+ "S-FIRSTNAME",
639
+ "B-GENDER",
640
+ "I-GENDER",
641
+ "E-GENDER",
642
+ "S-GENDER",
643
+ "B-GPSCOORDINATES",
644
+ "I-GPSCOORDINATES",
645
+ "E-GPSCOORDINATES",
646
+ "S-GPSCOORDINATES",
647
+ "B-HEIGHT",
648
+ "I-HEIGHT",
649
+ "E-HEIGHT",
650
+ "S-HEIGHT",
651
+ "B-IBAN",
652
+ "I-IBAN",
653
+ "E-IBAN",
654
+ "S-IBAN",
655
+ "B-IMEI",
656
+ "I-IMEI",
657
+ "E-IMEI",
658
+ "S-IMEI",
659
+ "B-IPADDRESS",
660
+ "I-IPADDRESS",
661
+ "E-IPADDRESS",
662
+ "S-IPADDRESS",
663
+ "B-JOBDEPARTMENT",
664
+ "I-JOBDEPARTMENT",
665
+ "E-JOBDEPARTMENT",
666
+ "S-JOBDEPARTMENT",
667
+ "B-JOBTITLE",
668
+ "I-JOBTITLE",
669
+ "E-JOBTITLE",
670
+ "S-JOBTITLE",
671
+ "B-LASTNAME",
672
+ "I-LASTNAME",
673
+ "E-LASTNAME",
674
+ "S-LASTNAME",
675
+ "B-LITECOINADDRESS",
676
+ "I-LITECOINADDRESS",
677
+ "E-LITECOINADDRESS",
678
+ "S-LITECOINADDRESS",
679
+ "B-MACADDRESS",
680
+ "I-MACADDRESS",
681
+ "E-MACADDRESS",
682
+ "S-MACADDRESS",
683
+ "B-MASKEDNUMBER",
684
+ "I-MASKEDNUMBER",
685
+ "E-MASKEDNUMBER",
686
+ "S-MASKEDNUMBER",
687
+ "B-MIDDLENAME",
688
+ "I-MIDDLENAME",
689
+ "E-MIDDLENAME",
690
+ "S-MIDDLENAME",
691
+ "B-OCCUPATION",
692
+ "I-OCCUPATION",
693
+ "E-OCCUPATION",
694
+ "S-OCCUPATION",
695
+ "B-ORDINALDIRECTION",
696
+ "I-ORDINALDIRECTION",
697
+ "E-ORDINALDIRECTION",
698
+ "S-ORDINALDIRECTION",
699
+ "B-ORGANIZATION",
700
+ "I-ORGANIZATION",
701
+ "E-ORGANIZATION",
702
+ "S-ORGANIZATION",
703
+ "B-PASSWORD",
704
+ "I-PASSWORD",
705
+ "E-PASSWORD",
706
+ "S-PASSWORD",
707
+ "B-PHONE",
708
+ "I-PHONE",
709
+ "E-PHONE",
710
+ "S-PHONE",
711
+ "B-PIN",
712
+ "I-PIN",
713
+ "E-PIN",
714
+ "S-PIN",
715
+ "B-PREFIX",
716
+ "I-PREFIX",
717
+ "E-PREFIX",
718
+ "S-PREFIX",
719
+ "B-SECONDARYADDRESS",
720
+ "I-SECONDARYADDRESS",
721
+ "E-SECONDARYADDRESS",
722
+ "S-SECONDARYADDRESS",
723
+ "B-SEX",
724
+ "I-SEX",
725
+ "E-SEX",
726
+ "S-SEX",
727
+ "B-SSN",
728
+ "I-SSN",
729
+ "E-SSN",
730
+ "S-SSN",
731
+ "B-STATE",
732
+ "I-STATE",
733
+ "E-STATE",
734
+ "S-STATE",
735
+ "B-STREET",
736
+ "I-STREET",
737
+ "E-STREET",
738
+ "S-STREET",
739
+ "B-TIME",
740
+ "I-TIME",
741
+ "E-TIME",
742
+ "S-TIME",
743
+ "B-URL",
744
+ "I-URL",
745
+ "E-URL",
746
+ "S-URL",
747
+ "B-USERAGENT",
748
+ "I-USERAGENT",
749
+ "E-USERAGENT",
750
+ "S-USERAGENT",
751
+ "B-USERNAME",
752
+ "I-USERNAME",
753
+ "E-USERNAME",
754
+ "S-USERNAME",
755
+ "B-VIN",
756
+ "I-VIN",
757
+ "E-VIN",
758
+ "S-VIN",
759
+ "B-VRM",
760
+ "I-VRM",
761
+ "E-VRM",
762
+ "S-VRM",
763
+ "B-ZIPCODE",
764
+ "I-ZIPCODE",
765
+ "E-ZIPCODE",
766
+ "S-ZIPCODE"
767
+ ],
768
+ "inference_contract_version": 1
769
+ }
770
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f0b62052eaabe7918b0b1ba24564432902b4c4a3c61404cf2b494679d21e97b6
3
+ size 2799225250
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0614fe83cadab421296e664e1f48f4261fa8fef6e03e63bb75c20f38e37d07d3
3
+ size 27868174
tokenizer_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "eos_token": "<|endoftext|>",
4
+ "model_input_names": [
5
+ "input_ids",
6
+ "attention_mask"
7
+ ],
8
+ "model_max_length": 128000,
9
+ "pad_token": "<|endoftext|>",
10
+ "tokenizer_class": "TokenizersBackend"
11
+ }