MaziyarPanahi commited on
Commit
61f8bf4
·
verified ·
1 Parent(s): 9b641d8

Initial upload: privacy-filter fine-tuned for 55 Nemotron-PII labels

Browse files
Files changed (6) hide show
  1. .gitattributes +1 -0
  2. README.md +233 -0
  3. config.json +783 -0
  4. model.safetensors +3 -0
  5. tokenizer.json +3 -0
  6. tokenizer_config.json +11 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,233 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ base_model: openai/privacy-filter
5
+ datasets:
6
+ - nvidia/Nemotron-PII
7
+ pipeline_tag: token-classification
8
+ tags:
9
+ - token-classification
10
+ - pii
11
+ - ner
12
+ - privacy
13
+ - redaction
14
+ - nemotron
15
+ - privacy-filter
16
+ language:
17
+ - en
18
+ ---
19
+
20
+ # privacy-filter-nemotron-55
21
+
22
+ Fine-tuned [openai/privacy-filter](https://huggingface.co/openai/privacy-filter)
23
+ for **fine-grained PII extraction** across **55 categories** from
24
+ [nvidia/Nemotron-PII](https://huggingface.co/datasets/nvidia/Nemotron-PII).
25
+
26
+ - **Base model**: [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) — 1.4B-parameter MoE (50M active per token), BIOES token-classification head
27
+ - **Task**: Token classification for PII detection (BIOES scheme)
28
+ - **Training data**: Full 100K rows of `nvidia/Nemotron-PII` train split
29
+ - **Held-out val**: 10K label-stratified rows from the Nemotron `test` split (every label has ≥229 entities)
30
+ - **Recipe**: `opf train` (OpenAI's official fine-tuning CLI) — full fine-tune, AdamW, lr=1e-4, 5 epochs, bf16, weight decay 0.0
31
+ - **Labels**: 55 fine-grained PII categories → 221 BIOES classes (1 `O` + 55 × B/I/E/S)
32
+
33
+ The base model ships with 8 coarse PII categories (`private_person`,
34
+ `private_email`, etc.). This model trades that coarse vocabulary for a
35
+ **5× more granular one** — `first_name`, `last_name`, `medical_record_number`,
36
+ `credit_debit_card`, `ssn`, and so on — matching what downstream redaction
37
+ and masking pipelines typically need.
38
+
39
+ ## Quick start
40
+
41
+ ### With `opf` (official CLI from OpenAI, recommended)
42
+
43
+ ```bash
44
+ pip install 'opf @ git+https://github.com/openai/privacy-filter.git'
45
+
46
+ opf redact \
47
+ --checkpoint OpenMed/privacy-filter-nemotron-55 \
48
+ --text "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, phone 415-555-0123."
49
+ ```
50
+
51
+ ### With `transformers`
52
+
53
+ ```python
54
+ import torch
55
+ from transformers import AutoModelForTokenClassification, AutoTokenizer
56
+
57
+ model_id = "OpenMed/privacy-filter-nemotron-55"
58
+ tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
59
+ model = AutoModelForTokenClassification.from_pretrained(
60
+ model_id, trust_remote_code=True, dtype=torch.bfloat16
61
+ ).to("cuda")
62
+ model.eval()
63
+
64
+ text = "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, phone 415-555-0123."
65
+ enc = tok(text, return_tensors="pt").to("cuda")
66
+ with torch.no_grad():
67
+ out = model(**enc).logits.argmax(-1).cpu()[0].tolist()
68
+
69
+ id2label = {int(k): v for k, v in model.config.id2label.items()}
70
+ tokens = tok.convert_ids_to_tokens(enc["input_ids"][0].cpu().tolist())
71
+ for t, l in zip(tokens, out):
72
+ if l != 0:
73
+ print(f"{t}\t{id2label[l]}")
74
+ ```
75
+
76
+ For best results use Viterbi decoding (not argmax) — the `opf` CLI does this by
77
+ default. If you're doing argmax with the HF transformers API, you'll see
78
+ slightly more boundary errors but still excellent label accuracy.
79
+
80
+ ## Performance
81
+
82
+ Evaluated with `opf eval --decode-mode viterbi --eval-mode typed --span-metrics-space char`
83
+ on the 10K label-stratified held-out val from `nvidia/Nemotron-PII:test`.
84
+
85
+ ### Headline
86
+
87
+ | Metric | Value |
88
+ |---|---:|
89
+ | **Macro B-F1** (across 55 labels) | **0.9533** |
90
+ | **Token accuracy** | **0.9910** |
91
+ | Strong labels (F1 ≥ 0.90) | 46 / 55 |
92
+ | Acceptable (F1 0.70–0.89) | 7 / 55 |
93
+ | Weak (F1 < 0.70) | 0 / 55 |
94
+
95
+ ### Per-label F1 (B-tag, sorted)
96
+
97
+ | Label | Precision | Recall | F1 |
98
+ |---|---:|---:|---:|
99
+ | 🟢 `mac_address` | 1.000 | 1.000 | **1.000** |
100
+ | 🟢 `biometric_identifier` | 0.999 | 0.998 | **0.999** |
101
+ | 🟢 `bank_routing_number` | 0.995 | 0.999 | **0.997** |
102
+ | 🟢 `credit_debit_card` | 0.999 | 0.993 | **0.996** |
103
+ | 🟢 `ipv6` | 0.992 | 1.000 | **0.996** |
104
+ | 🟢 `health_plan_beneficiary_number` | 1.000 | 0.990 | **0.995** |
105
+ | 🟢 `coordinate` | 0.994 | 0.996 | **0.995** |
106
+ | 🟢 `ipv4` | 0.993 | 0.996 | **0.994** |
107
+ | 🟢 `url` | 0.989 | 0.999 | **0.994** |
108
+ | 🟢 `email` | 0.994 | 0.993 | **0.994** |
109
+ | 🟢 `date_of_birth` | 0.992 | 0.994 | **0.993** |
110
+ | 🟢 `medical_record_number` | 0.997 | 0.989 | **0.993** |
111
+ | 🟢 `street_address` | 0.996 | 0.989 | **0.993** |
112
+ | 🟢 `vehicle_identifier` | 0.986 | 0.996 | **0.991** |
113
+ | 🟢 `license_plate` | 0.987 | 0.993 | **0.990** |
114
+ | 🟢 `customer_id` | 0.995 | 0.984 | **0.990** |
115
+ | 🟢 `http_cookie` | 0.992 | 0.983 | **0.988** |
116
+ | 🟢 `employee_id` | 0.987 | 0.988 | **0.988** |
117
+ | 🟢 `account_number` | 0.992 | 0.982 | **0.987** |
118
+ | 🟢 `certificate_license_number` | 0.989 | 0.984 | **0.987** |
119
+ | 🟢 `swift_bic` | 0.975 | 0.998 | **0.987** |
120
+ | 🟢 `postcode` | 0.991 | 0.981 | **0.986** |
121
+ | 🟢 `api_key` | 0.980 | 0.990 | **0.985** |
122
+ | 🟢 `password` | 0.999 | 0.968 | **0.983** |
123
+ | 🟢 `tax_id` | 1.000 | 0.965 | **0.982** |
124
+ | 🟢 `device_identifier` | 0.974 | 0.988 | **0.981** |
125
+ | 🟢 `national_id` | 0.991 | 0.961 | **0.976** |
126
+ | 🟢 `last_name` | 0.977 | 0.975 | **0.976** |
127
+ | 🟢 `date_time` | 0.982 | 0.967 | **0.974** |
128
+ | 🟢 `first_name` | 0.962 | 0.978 | **0.970** |
129
+ | 🟢 `pin` | 0.973 | 0.967 | **0.970** |
130
+ | 🟢 `phone_number` | 0.948 | 0.992 | **0.970** |
131
+ | 🟢 `county` | 0.986 | 0.946 | **0.965** |
132
+ | 🟢 `employment_status` | 0.960 | 0.968 | **0.964** |
133
+ | 🟢 `user_name` | 0.959 | 0.964 | **0.961** |
134
+ | 🟢 `date` | 0.967 | 0.955 | **0.961** |
135
+ | 🟢 `blood_type` | 0.922 | 0.954 | **0.938** |
136
+ | 🟢 `country` | 0.955 | 0.918 | **0.936** |
137
+ | 🟢 `ssn` | 0.926 | 0.945 | **0.935** |
138
+ | 🟢 `education_level` | 0.961 | 0.908 | **0.934** |
139
+ | 🟢 `sexuality` | 0.908 | 0.956 | **0.931** |
140
+ | 🟢 `company_name` | 0.967 | 0.894 | **0.929** |
141
+ | 🟢 `religious_belief` | 0.912 | 0.941 | **0.926** |
142
+ | 🟢 `unique_id` | 0.910 | 0.922 | **0.916** |
143
+ | 🟢 `political_view` | 0.939 | 0.872 | **0.905** |
144
+ | 🟢 `fax_number` | 0.978 | 0.841 | **0.904** |
145
+ | 🟡 `city` | 0.917 | 0.876 | **0.896** |
146
+ | 🟡 `time` | 0.933 | 0.802 | **0.863** |
147
+ | 🟡 `race_ethnicity` | 0.821 | 0.906 | **0.861** |
148
+ | 🟡 `gender` | 0.967 | 0.744 | **0.841** |
149
+ | 🟡 `state` | 0.878 | 0.785 | **0.829** |
150
+ | 🟡 `language` | 0.889 | 0.735 | **0.804** |
151
+ | 🟡 `occupation` | 0.799 | 0.667 | **0.727** |
152
+
153
+ ## Label space (55 categories)
154
+
155
+ | Category | Typical examples |
156
+ |---|---|
157
+ | **Identity** | `first_name`, `last_name`, `user_name`, `age`, `gender`, `race_ethnicity`, `sexuality`, `religious_belief`, `political_view`, `marital_status`, `nationality`, `education_level`, `occupation`, `employment_status`, `language`, `blood_type`, `biometric_identifier` |
158
+ | **Contact** | `email`, `phone_number`, `fax_number`, `url` |
159
+ | **Address** | `street_address`, `city`, `county`, `state`, `country`, `postcode`, `coordinate` |
160
+ | **Dates** | `date`, `date_of_birth`, `date_time`, `time` |
161
+ | **Government IDs** | `ssn`, `national_id`, `tax_id` |
162
+ | **Financial** | `account_number`, `bank_routing_number`, `swift_bic`, `credit_debit_card`, `cvv`, `pin`, `password` |
163
+ | **Healthcare** | `medical_record_number`, `health_plan_beneficiary_number` |
164
+ | **Enterprise IDs** | `customer_id`, `employee_id`, `unique_id`, `certificate_license_number` |
165
+ | **Vehicle** | `license_plate`, `vehicle_identifier` |
166
+ | **Digital** | `ipv4`, `ipv6`, `mac_address`, `device_identifier`, `api_key`, `http_cookie` |
167
+
168
+ ## Training details
169
+
170
+ | Hyperparameter | Value |
171
+ |---|---|
172
+ | Optimizer | AdamW |
173
+ | Learning rate | 1e-4 |
174
+ | Weight decay | 0.0 |
175
+ | Batch size (per GPU) | 4 |
176
+ | Gradient accumulation | 1 |
177
+ | Max grad norm | 1.0 |
178
+ | Epochs | 5 |
179
+ | Precision | bf16 |
180
+ | Context length | 128K (YaRN RoPE, 128-token sliding window) |
181
+ | Hardware | 1× NVIDIA A100 80GB |
182
+ | Total optimizer steps | 125,000 |
183
+ | Framework | `opf train` v0.1.0 |
184
+
185
+ **Head initialization**: `opf`'s default "copy-from-matching-base" head init.
186
+ Of the 221 new BIOES classes, 5 had exact matches in the base
187
+ (`O`, `B/I/E/S-account_number`); the other 216 were copied from
188
+ semantically-adjacent coarse rows and fine-tuned end-to-end.
189
+
190
+ **Router**: base model has 128 MoE experts per layer with top-4 routing.
191
+ Routers were kept trainable during full fine-tuning; no collapse was
192
+ observed.
193
+
194
+ ## Limitations & intended use
195
+
196
+ - **English-only training data.** Nemotron-PII is predominantly English
197
+ with a 50/50 US/international locale split. Performance on non-English
198
+ text is not guaranteed.
199
+ - **`occupation`, `language`, `gender`, `state`, `race_ethnicity`,
200
+ `political_view`, `education_level` are fuzzier categories** than the
201
+ strict identifiers — F1 lands in 0.65–0.89 vs 0.95+ for formatted
202
+ identifiers. If your downstream only cares about strict PII, you can
203
+ ignore low-confidence predictions on these.
204
+ - **Synthetic training data.** Nemotron-PII is a synthesized dataset; real
205
+ clinical notes, legal documents, and web text may show different
206
+ surface forms. For high-stakes deployments, collect a domain-specific
207
+ eval set and re-calibrate thresholds.
208
+ - **Not a substitute for legal compliance review.** Use alongside a
209
+ governance layer (human review, deterministic regex pre-filters, etc.).
210
+
211
+ ## License
212
+
213
+ Apache 2.0, same as the base model.
214
+
215
+ ## Citation
216
+
217
+ If you use this model, please cite both the base model and the dataset:
218
+
219
+ ```bibtex
220
+ @misc{openai_privacy_filter_2025,
221
+ author = {OpenAI},
222
+ title = {{openai/privacy-filter}},
223
+ year = {2025},
224
+ howpublished = {\url{https://huggingface.co/openai/privacy-filter}}
225
+ }
226
+
227
+ @misc{nemotron_pii_2025,
228
+ author = {NVIDIA},
229
+ title = {{Nemotron-PII}},
230
+ year = {2025},
231
+ howpublished = {\url{https://huggingface.co/datasets/nvidia/Nemotron-PII}}
232
+ }
233
+ ```
config.json ADDED
@@ -0,0 +1,783 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "OpenAIPrivacyFilterForTokenClassification"
4
+ ],
5
+ "attention_bias": true,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": null,
8
+ "classifier_dropout": 0.0,
9
+ "default_n_ctx": 128000,
10
+ "dtype": "bfloat16",
11
+ "eos_token_id": 199999,
12
+ "head_dim": 64,
13
+ "hidden_act": "silu",
14
+ "hidden_size": 640,
15
+ "id2label": {
16
+ "0": "O",
17
+ "1": "B-account_number",
18
+ "2": "I-account_number",
19
+ "3": "E-account_number",
20
+ "4": "S-account_number",
21
+ "5": "B-age",
22
+ "6": "I-age",
23
+ "7": "E-age",
24
+ "8": "S-age",
25
+ "9": "B-api_key",
26
+ "10": "I-api_key",
27
+ "11": "E-api_key",
28
+ "12": "S-api_key",
29
+ "13": "B-bank_routing_number",
30
+ "14": "I-bank_routing_number",
31
+ "15": "E-bank_routing_number",
32
+ "16": "S-bank_routing_number",
33
+ "17": "B-biometric_identifier",
34
+ "18": "I-biometric_identifier",
35
+ "19": "E-biometric_identifier",
36
+ "20": "S-biometric_identifier",
37
+ "21": "B-blood_type",
38
+ "22": "I-blood_type",
39
+ "23": "E-blood_type",
40
+ "24": "S-blood_type",
41
+ "25": "B-certificate_license_number",
42
+ "26": "I-certificate_license_number",
43
+ "27": "E-certificate_license_number",
44
+ "28": "S-certificate_license_number",
45
+ "29": "B-city",
46
+ "30": "I-city",
47
+ "31": "E-city",
48
+ "32": "S-city",
49
+ "33": "B-company_name",
50
+ "34": "I-company_name",
51
+ "35": "E-company_name",
52
+ "36": "S-company_name",
53
+ "37": "B-coordinate",
54
+ "38": "I-coordinate",
55
+ "39": "E-coordinate",
56
+ "40": "S-coordinate",
57
+ "41": "B-country",
58
+ "42": "I-country",
59
+ "43": "E-country",
60
+ "44": "S-country",
61
+ "45": "B-county",
62
+ "46": "I-county",
63
+ "47": "E-county",
64
+ "48": "S-county",
65
+ "49": "B-credit_debit_card",
66
+ "50": "I-credit_debit_card",
67
+ "51": "E-credit_debit_card",
68
+ "52": "S-credit_debit_card",
69
+ "53": "B-customer_id",
70
+ "54": "I-customer_id",
71
+ "55": "E-customer_id",
72
+ "56": "S-customer_id",
73
+ "57": "B-cvv",
74
+ "58": "I-cvv",
75
+ "59": "E-cvv",
76
+ "60": "S-cvv",
77
+ "61": "B-date",
78
+ "62": "I-date",
79
+ "63": "E-date",
80
+ "64": "S-date",
81
+ "65": "B-date_of_birth",
82
+ "66": "I-date_of_birth",
83
+ "67": "E-date_of_birth",
84
+ "68": "S-date_of_birth",
85
+ "69": "B-date_time",
86
+ "70": "I-date_time",
87
+ "71": "E-date_time",
88
+ "72": "S-date_time",
89
+ "73": "B-device_identifier",
90
+ "74": "I-device_identifier",
91
+ "75": "E-device_identifier",
92
+ "76": "S-device_identifier",
93
+ "77": "B-education_level",
94
+ "78": "I-education_level",
95
+ "79": "E-education_level",
96
+ "80": "S-education_level",
97
+ "81": "B-email",
98
+ "82": "I-email",
99
+ "83": "E-email",
100
+ "84": "S-email",
101
+ "85": "B-employee_id",
102
+ "86": "I-employee_id",
103
+ "87": "E-employee_id",
104
+ "88": "S-employee_id",
105
+ "89": "B-employment_status",
106
+ "90": "I-employment_status",
107
+ "91": "E-employment_status",
108
+ "92": "S-employment_status",
109
+ "93": "B-fax_number",
110
+ "94": "I-fax_number",
111
+ "95": "E-fax_number",
112
+ "96": "S-fax_number",
113
+ "97": "B-first_name",
114
+ "98": "I-first_name",
115
+ "99": "E-first_name",
116
+ "100": "S-first_name",
117
+ "101": "B-gender",
118
+ "102": "I-gender",
119
+ "103": "E-gender",
120
+ "104": "S-gender",
121
+ "105": "B-health_plan_beneficiary_number",
122
+ "106": "I-health_plan_beneficiary_number",
123
+ "107": "E-health_plan_beneficiary_number",
124
+ "108": "S-health_plan_beneficiary_number",
125
+ "109": "B-http_cookie",
126
+ "110": "I-http_cookie",
127
+ "111": "E-http_cookie",
128
+ "112": "S-http_cookie",
129
+ "113": "B-ipv4",
130
+ "114": "I-ipv4",
131
+ "115": "E-ipv4",
132
+ "116": "S-ipv4",
133
+ "117": "B-ipv6",
134
+ "118": "I-ipv6",
135
+ "119": "E-ipv6",
136
+ "120": "S-ipv6",
137
+ "121": "B-language",
138
+ "122": "I-language",
139
+ "123": "E-language",
140
+ "124": "S-language",
141
+ "125": "B-last_name",
142
+ "126": "I-last_name",
143
+ "127": "E-last_name",
144
+ "128": "S-last_name",
145
+ "129": "B-license_plate",
146
+ "130": "I-license_plate",
147
+ "131": "E-license_plate",
148
+ "132": "S-license_plate",
149
+ "133": "B-mac_address",
150
+ "134": "I-mac_address",
151
+ "135": "E-mac_address",
152
+ "136": "S-mac_address",
153
+ "137": "B-medical_record_number",
154
+ "138": "I-medical_record_number",
155
+ "139": "E-medical_record_number",
156
+ "140": "S-medical_record_number",
157
+ "141": "B-national_id",
158
+ "142": "I-national_id",
159
+ "143": "E-national_id",
160
+ "144": "S-national_id",
161
+ "145": "B-occupation",
162
+ "146": "I-occupation",
163
+ "147": "E-occupation",
164
+ "148": "S-occupation",
165
+ "149": "B-password",
166
+ "150": "I-password",
167
+ "151": "E-password",
168
+ "152": "S-password",
169
+ "153": "B-phone_number",
170
+ "154": "I-phone_number",
171
+ "155": "E-phone_number",
172
+ "156": "S-phone_number",
173
+ "157": "B-pin",
174
+ "158": "I-pin",
175
+ "159": "E-pin",
176
+ "160": "S-pin",
177
+ "161": "B-political_view",
178
+ "162": "I-political_view",
179
+ "163": "E-political_view",
180
+ "164": "S-political_view",
181
+ "165": "B-postcode",
182
+ "166": "I-postcode",
183
+ "167": "E-postcode",
184
+ "168": "S-postcode",
185
+ "169": "B-race_ethnicity",
186
+ "170": "I-race_ethnicity",
187
+ "171": "E-race_ethnicity",
188
+ "172": "S-race_ethnicity",
189
+ "173": "B-religious_belief",
190
+ "174": "I-religious_belief",
191
+ "175": "E-religious_belief",
192
+ "176": "S-religious_belief",
193
+ "177": "B-sexuality",
194
+ "178": "I-sexuality",
195
+ "179": "E-sexuality",
196
+ "180": "S-sexuality",
197
+ "181": "B-ssn",
198
+ "182": "I-ssn",
199
+ "183": "E-ssn",
200
+ "184": "S-ssn",
201
+ "185": "B-state",
202
+ "186": "I-state",
203
+ "187": "E-state",
204
+ "188": "S-state",
205
+ "189": "B-street_address",
206
+ "190": "I-street_address",
207
+ "191": "E-street_address",
208
+ "192": "S-street_address",
209
+ "193": "B-swift_bic",
210
+ "194": "I-swift_bic",
211
+ "195": "E-swift_bic",
212
+ "196": "S-swift_bic",
213
+ "197": "B-tax_id",
214
+ "198": "I-tax_id",
215
+ "199": "E-tax_id",
216
+ "200": "S-tax_id",
217
+ "201": "B-time",
218
+ "202": "I-time",
219
+ "203": "E-time",
220
+ "204": "S-time",
221
+ "205": "B-unique_id",
222
+ "206": "I-unique_id",
223
+ "207": "E-unique_id",
224
+ "208": "S-unique_id",
225
+ "209": "B-url",
226
+ "210": "I-url",
227
+ "211": "E-url",
228
+ "212": "S-url",
229
+ "213": "B-user_name",
230
+ "214": "I-user_name",
231
+ "215": "E-user_name",
232
+ "216": "S-user_name",
233
+ "217": "B-vehicle_identifier",
234
+ "218": "I-vehicle_identifier",
235
+ "219": "E-vehicle_identifier",
236
+ "220": "S-vehicle_identifier"
237
+ },
238
+ "initial_context_length": 4096,
239
+ "initializer_range": 0.02,
240
+ "intermediate_size": 640,
241
+ "label2id": {
242
+ "O": 0,
243
+ "B-account_number": 1,
244
+ "I-account_number": 2,
245
+ "E-account_number": 3,
246
+ "S-account_number": 4,
247
+ "B-age": 5,
248
+ "I-age": 6,
249
+ "E-age": 7,
250
+ "S-age": 8,
251
+ "B-api_key": 9,
252
+ "I-api_key": 10,
253
+ "E-api_key": 11,
254
+ "S-api_key": 12,
255
+ "B-bank_routing_number": 13,
256
+ "I-bank_routing_number": 14,
257
+ "E-bank_routing_number": 15,
258
+ "S-bank_routing_number": 16,
259
+ "B-biometric_identifier": 17,
260
+ "I-biometric_identifier": 18,
261
+ "E-biometric_identifier": 19,
262
+ "S-biometric_identifier": 20,
263
+ "B-blood_type": 21,
264
+ "I-blood_type": 22,
265
+ "E-blood_type": 23,
266
+ "S-blood_type": 24,
267
+ "B-certificate_license_number": 25,
268
+ "I-certificate_license_number": 26,
269
+ "E-certificate_license_number": 27,
270
+ "S-certificate_license_number": 28,
271
+ "B-city": 29,
272
+ "I-city": 30,
273
+ "E-city": 31,
274
+ "S-city": 32,
275
+ "B-company_name": 33,
276
+ "I-company_name": 34,
277
+ "E-company_name": 35,
278
+ "S-company_name": 36,
279
+ "B-coordinate": 37,
280
+ "I-coordinate": 38,
281
+ "E-coordinate": 39,
282
+ "S-coordinate": 40,
283
+ "B-country": 41,
284
+ "I-country": 42,
285
+ "E-country": 43,
286
+ "S-country": 44,
287
+ "B-county": 45,
288
+ "I-county": 46,
289
+ "E-county": 47,
290
+ "S-county": 48,
291
+ "B-credit_debit_card": 49,
292
+ "I-credit_debit_card": 50,
293
+ "E-credit_debit_card": 51,
294
+ "S-credit_debit_card": 52,
295
+ "B-customer_id": 53,
296
+ "I-customer_id": 54,
297
+ "E-customer_id": 55,
298
+ "S-customer_id": 56,
299
+ "B-cvv": 57,
300
+ "I-cvv": 58,
301
+ "E-cvv": 59,
302
+ "S-cvv": 60,
303
+ "B-date": 61,
304
+ "I-date": 62,
305
+ "E-date": 63,
306
+ "S-date": 64,
307
+ "B-date_of_birth": 65,
308
+ "I-date_of_birth": 66,
309
+ "E-date_of_birth": 67,
310
+ "S-date_of_birth": 68,
311
+ "B-date_time": 69,
312
+ "I-date_time": 70,
313
+ "E-date_time": 71,
314
+ "S-date_time": 72,
315
+ "B-device_identifier": 73,
316
+ "I-device_identifier": 74,
317
+ "E-device_identifier": 75,
318
+ "S-device_identifier": 76,
319
+ "B-education_level": 77,
320
+ "I-education_level": 78,
321
+ "E-education_level": 79,
322
+ "S-education_level": 80,
323
+ "B-email": 81,
324
+ "I-email": 82,
325
+ "E-email": 83,
326
+ "S-email": 84,
327
+ "B-employee_id": 85,
328
+ "I-employee_id": 86,
329
+ "E-employee_id": 87,
330
+ "S-employee_id": 88,
331
+ "B-employment_status": 89,
332
+ "I-employment_status": 90,
333
+ "E-employment_status": 91,
334
+ "S-employment_status": 92,
335
+ "B-fax_number": 93,
336
+ "I-fax_number": 94,
337
+ "E-fax_number": 95,
338
+ "S-fax_number": 96,
339
+ "B-first_name": 97,
340
+ "I-first_name": 98,
341
+ "E-first_name": 99,
342
+ "S-first_name": 100,
343
+ "B-gender": 101,
344
+ "I-gender": 102,
345
+ "E-gender": 103,
346
+ "S-gender": 104,
347
+ "B-health_plan_beneficiary_number": 105,
348
+ "I-health_plan_beneficiary_number": 106,
349
+ "E-health_plan_beneficiary_number": 107,
350
+ "S-health_plan_beneficiary_number": 108,
351
+ "B-http_cookie": 109,
352
+ "I-http_cookie": 110,
353
+ "E-http_cookie": 111,
354
+ "S-http_cookie": 112,
355
+ "B-ipv4": 113,
356
+ "I-ipv4": 114,
357
+ "E-ipv4": 115,
358
+ "S-ipv4": 116,
359
+ "B-ipv6": 117,
360
+ "I-ipv6": 118,
361
+ "E-ipv6": 119,
362
+ "S-ipv6": 120,
363
+ "B-language": 121,
364
+ "I-language": 122,
365
+ "E-language": 123,
366
+ "S-language": 124,
367
+ "B-last_name": 125,
368
+ "I-last_name": 126,
369
+ "E-last_name": 127,
370
+ "S-last_name": 128,
371
+ "B-license_plate": 129,
372
+ "I-license_plate": 130,
373
+ "E-license_plate": 131,
374
+ "S-license_plate": 132,
375
+ "B-mac_address": 133,
376
+ "I-mac_address": 134,
377
+ "E-mac_address": 135,
378
+ "S-mac_address": 136,
379
+ "B-medical_record_number": 137,
380
+ "I-medical_record_number": 138,
381
+ "E-medical_record_number": 139,
382
+ "S-medical_record_number": 140,
383
+ "B-national_id": 141,
384
+ "I-national_id": 142,
385
+ "E-national_id": 143,
386
+ "S-national_id": 144,
387
+ "B-occupation": 145,
388
+ "I-occupation": 146,
389
+ "E-occupation": 147,
390
+ "S-occupation": 148,
391
+ "B-password": 149,
392
+ "I-password": 150,
393
+ "E-password": 151,
394
+ "S-password": 152,
395
+ "B-phone_number": 153,
396
+ "I-phone_number": 154,
397
+ "E-phone_number": 155,
398
+ "S-phone_number": 156,
399
+ "B-pin": 157,
400
+ "I-pin": 158,
401
+ "E-pin": 159,
402
+ "S-pin": 160,
403
+ "B-political_view": 161,
404
+ "I-political_view": 162,
405
+ "E-political_view": 163,
406
+ "S-political_view": 164,
407
+ "B-postcode": 165,
408
+ "I-postcode": 166,
409
+ "E-postcode": 167,
410
+ "S-postcode": 168,
411
+ "B-race_ethnicity": 169,
412
+ "I-race_ethnicity": 170,
413
+ "E-race_ethnicity": 171,
414
+ "S-race_ethnicity": 172,
415
+ "B-religious_belief": 173,
416
+ "I-religious_belief": 174,
417
+ "E-religious_belief": 175,
418
+ "S-religious_belief": 176,
419
+ "B-sexuality": 177,
420
+ "I-sexuality": 178,
421
+ "E-sexuality": 179,
422
+ "S-sexuality": 180,
423
+ "B-ssn": 181,
424
+ "I-ssn": 182,
425
+ "E-ssn": 183,
426
+ "S-ssn": 184,
427
+ "B-state": 185,
428
+ "I-state": 186,
429
+ "E-state": 187,
430
+ "S-state": 188,
431
+ "B-street_address": 189,
432
+ "I-street_address": 190,
433
+ "E-street_address": 191,
434
+ "S-street_address": 192,
435
+ "B-swift_bic": 193,
436
+ "I-swift_bic": 194,
437
+ "E-swift_bic": 195,
438
+ "S-swift_bic": 196,
439
+ "B-tax_id": 197,
440
+ "I-tax_id": 198,
441
+ "E-tax_id": 199,
442
+ "S-tax_id": 200,
443
+ "B-time": 201,
444
+ "I-time": 202,
445
+ "E-time": 203,
446
+ "S-time": 204,
447
+ "B-unique_id": 205,
448
+ "I-unique_id": 206,
449
+ "E-unique_id": 207,
450
+ "S-unique_id": 208,
451
+ "B-url": 209,
452
+ "I-url": 210,
453
+ "E-url": 211,
454
+ "S-url": 212,
455
+ "B-user_name": 213,
456
+ "I-user_name": 214,
457
+ "E-user_name": 215,
458
+ "S-user_name": 216,
459
+ "B-vehicle_identifier": 217,
460
+ "I-vehicle_identifier": 218,
461
+ "E-vehicle_identifier": 219,
462
+ "S-vehicle_identifier": 220
463
+ },
464
+ "max_position_embeddings": 131072,
465
+ "model_type": "openai_privacy_filter",
466
+ "num_attention_heads": 14,
467
+ "num_experts_per_tok": 4,
468
+ "num_hidden_layers": 8,
469
+ "num_key_value_heads": 2,
470
+ "num_local_experts": 128,
471
+ "output_router_logits": false,
472
+ "pad_token_id": 199999,
473
+ "rms_norm_eps": 1e-05,
474
+ "rope_parameters": {
475
+ "beta_fast": 32.0,
476
+ "beta_slow": 1.0,
477
+ "factor": 32.0,
478
+ "original_max_position_embeddings": 4096,
479
+ "rope_theta": 150000.0,
480
+ "rope_type": "yarn",
481
+ "truncate": false
482
+ },
483
+ "router_aux_loss_coef": 0.001,
484
+ "sliding_window": 128,
485
+ "tie_word_embeddings": false,
486
+ "transformers_version": "5.6.0.dev0",
487
+ "use_cache": true,
488
+ "vocab_size": 200064,
489
+ "transformers.js_config": {
490
+ "use_external_data_format": {
491
+ "model.onnx": 3,
492
+ "model_fp16.onnx": 2,
493
+ "model": 1
494
+ }
495
+ },
496
+ "num_labels": 221,
497
+ "opf_metadata": {
498
+ "category_version": "nemotron_fine_v1",
499
+ "encoding": "o200k_base",
500
+ "span_class_names": [
501
+ "O",
502
+ "account_number",
503
+ "age",
504
+ "api_key",
505
+ "bank_routing_number",
506
+ "biometric_identifier",
507
+ "blood_type",
508
+ "certificate_license_number",
509
+ "city",
510
+ "company_name",
511
+ "coordinate",
512
+ "country",
513
+ "county",
514
+ "credit_debit_card",
515
+ "customer_id",
516
+ "cvv",
517
+ "date",
518
+ "date_of_birth",
519
+ "date_time",
520
+ "device_identifier",
521
+ "education_level",
522
+ "email",
523
+ "employee_id",
524
+ "employment_status",
525
+ "fax_number",
526
+ "first_name",
527
+ "gender",
528
+ "health_plan_beneficiary_number",
529
+ "http_cookie",
530
+ "ipv4",
531
+ "ipv6",
532
+ "language",
533
+ "last_name",
534
+ "license_plate",
535
+ "mac_address",
536
+ "medical_record_number",
537
+ "national_id",
538
+ "occupation",
539
+ "password",
540
+ "phone_number",
541
+ "pin",
542
+ "political_view",
543
+ "postcode",
544
+ "race_ethnicity",
545
+ "religious_belief",
546
+ "sexuality",
547
+ "ssn",
548
+ "state",
549
+ "street_address",
550
+ "swift_bic",
551
+ "tax_id",
552
+ "time",
553
+ "unique_id",
554
+ "url",
555
+ "user_name",
556
+ "vehicle_identifier"
557
+ ],
558
+ "ner_class_names": [
559
+ "O",
560
+ "B-account_number",
561
+ "I-account_number",
562
+ "E-account_number",
563
+ "S-account_number",
564
+ "B-age",
565
+ "I-age",
566
+ "E-age",
567
+ "S-age",
568
+ "B-api_key",
569
+ "I-api_key",
570
+ "E-api_key",
571
+ "S-api_key",
572
+ "B-bank_routing_number",
573
+ "I-bank_routing_number",
574
+ "E-bank_routing_number",
575
+ "S-bank_routing_number",
576
+ "B-biometric_identifier",
577
+ "I-biometric_identifier",
578
+ "E-biometric_identifier",
579
+ "S-biometric_identifier",
580
+ "B-blood_type",
581
+ "I-blood_type",
582
+ "E-blood_type",
583
+ "S-blood_type",
584
+ "B-certificate_license_number",
585
+ "I-certificate_license_number",
586
+ "E-certificate_license_number",
587
+ "S-certificate_license_number",
588
+ "B-city",
589
+ "I-city",
590
+ "E-city",
591
+ "S-city",
592
+ "B-company_name",
593
+ "I-company_name",
594
+ "E-company_name",
595
+ "S-company_name",
596
+ "B-coordinate",
597
+ "I-coordinate",
598
+ "E-coordinate",
599
+ "S-coordinate",
600
+ "B-country",
601
+ "I-country",
602
+ "E-country",
603
+ "S-country",
604
+ "B-county",
605
+ "I-county",
606
+ "E-county",
607
+ "S-county",
608
+ "B-credit_debit_card",
609
+ "I-credit_debit_card",
610
+ "E-credit_debit_card",
611
+ "S-credit_debit_card",
612
+ "B-customer_id",
613
+ "I-customer_id",
614
+ "E-customer_id",
615
+ "S-customer_id",
616
+ "B-cvv",
617
+ "I-cvv",
618
+ "E-cvv",
619
+ "S-cvv",
620
+ "B-date",
621
+ "I-date",
622
+ "E-date",
623
+ "S-date",
624
+ "B-date_of_birth",
625
+ "I-date_of_birth",
626
+ "E-date_of_birth",
627
+ "S-date_of_birth",
628
+ "B-date_time",
629
+ "I-date_time",
630
+ "E-date_time",
631
+ "S-date_time",
632
+ "B-device_identifier",
633
+ "I-device_identifier",
634
+ "E-device_identifier",
635
+ "S-device_identifier",
636
+ "B-education_level",
637
+ "I-education_level",
638
+ "E-education_level",
639
+ "S-education_level",
640
+ "B-email",
641
+ "I-email",
642
+ "E-email",
643
+ "S-email",
644
+ "B-employee_id",
645
+ "I-employee_id",
646
+ "E-employee_id",
647
+ "S-employee_id",
648
+ "B-employment_status",
649
+ "I-employment_status",
650
+ "E-employment_status",
651
+ "S-employment_status",
652
+ "B-fax_number",
653
+ "I-fax_number",
654
+ "E-fax_number",
655
+ "S-fax_number",
656
+ "B-first_name",
657
+ "I-first_name",
658
+ "E-first_name",
659
+ "S-first_name",
660
+ "B-gender",
661
+ "I-gender",
662
+ "E-gender",
663
+ "S-gender",
664
+ "B-health_plan_beneficiary_number",
665
+ "I-health_plan_beneficiary_number",
666
+ "E-health_plan_beneficiary_number",
667
+ "S-health_plan_beneficiary_number",
668
+ "B-http_cookie",
669
+ "I-http_cookie",
670
+ "E-http_cookie",
671
+ "S-http_cookie",
672
+ "B-ipv4",
673
+ "I-ipv4",
674
+ "E-ipv4",
675
+ "S-ipv4",
676
+ "B-ipv6",
677
+ "I-ipv6",
678
+ "E-ipv6",
679
+ "S-ipv6",
680
+ "B-language",
681
+ "I-language",
682
+ "E-language",
683
+ "S-language",
684
+ "B-last_name",
685
+ "I-last_name",
686
+ "E-last_name",
687
+ "S-last_name",
688
+ "B-license_plate",
689
+ "I-license_plate",
690
+ "E-license_plate",
691
+ "S-license_plate",
692
+ "B-mac_address",
693
+ "I-mac_address",
694
+ "E-mac_address",
695
+ "S-mac_address",
696
+ "B-medical_record_number",
697
+ "I-medical_record_number",
698
+ "E-medical_record_number",
699
+ "S-medical_record_number",
700
+ "B-national_id",
701
+ "I-national_id",
702
+ "E-national_id",
703
+ "S-national_id",
704
+ "B-occupation",
705
+ "I-occupation",
706
+ "E-occupation",
707
+ "S-occupation",
708
+ "B-password",
709
+ "I-password",
710
+ "E-password",
711
+ "S-password",
712
+ "B-phone_number",
713
+ "I-phone_number",
714
+ "E-phone_number",
715
+ "S-phone_number",
716
+ "B-pin",
717
+ "I-pin",
718
+ "E-pin",
719
+ "S-pin",
720
+ "B-political_view",
721
+ "I-political_view",
722
+ "E-political_view",
723
+ "S-political_view",
724
+ "B-postcode",
725
+ "I-postcode",
726
+ "E-postcode",
727
+ "S-postcode",
728
+ "B-race_ethnicity",
729
+ "I-race_ethnicity",
730
+ "E-race_ethnicity",
731
+ "S-race_ethnicity",
732
+ "B-religious_belief",
733
+ "I-religious_belief",
734
+ "E-religious_belief",
735
+ "S-religious_belief",
736
+ "B-sexuality",
737
+ "I-sexuality",
738
+ "E-sexuality",
739
+ "S-sexuality",
740
+ "B-ssn",
741
+ "I-ssn",
742
+ "E-ssn",
743
+ "S-ssn",
744
+ "B-state",
745
+ "I-state",
746
+ "E-state",
747
+ "S-state",
748
+ "B-street_address",
749
+ "I-street_address",
750
+ "E-street_address",
751
+ "S-street_address",
752
+ "B-swift_bic",
753
+ "I-swift_bic",
754
+ "E-swift_bic",
755
+ "S-swift_bic",
756
+ "B-tax_id",
757
+ "I-tax_id",
758
+ "E-tax_id",
759
+ "S-tax_id",
760
+ "B-time",
761
+ "I-time",
762
+ "E-time",
763
+ "S-time",
764
+ "B-unique_id",
765
+ "I-unique_id",
766
+ "E-unique_id",
767
+ "S-unique_id",
768
+ "B-url",
769
+ "I-url",
770
+ "E-url",
771
+ "S-url",
772
+ "B-user_name",
773
+ "I-user_name",
774
+ "E-user_name",
775
+ "S-user_name",
776
+ "B-vehicle_identifier",
777
+ "I-vehicle_identifier",
778
+ "E-vehicle_identifier",
779
+ "S-vehicle_identifier"
780
+ ],
781
+ "inference_contract_version": 1
782
+ }
783
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8acd5011fb4ee1634aac00c565342626875a82ece2301f15bf72db7db690c542
3
+ size 2799230378
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0614fe83cadab421296e664e1f48f4261fa8fef6e03e63bb75c20f38e37d07d3
3
+ size 27868174
tokenizer_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "eos_token": "<|endoftext|>",
4
+ "model_input_names": [
5
+ "input_ids",
6
+ "attention_mask"
7
+ ],
8
+ "model_max_length": 128000,
9
+ "pad_token": "<|endoftext|>",
10
+ "tokenizer_class": "TokenizersBackend"
11
+ }