File size: 12,671 Bytes
61f8bf4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d4a9d9d
61f8bf4
 
 
 
8b9cf96
61f8bf4
d4a9d9d
61f8bf4
d4a9d9d
61f8bf4
 
 
 
 
 
 
 
 
 
 
 
 
 
d4a9d9d
 
 
 
 
61f8bf4
 
d4a9d9d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61f8bf4
 
 
 
 
8b9cf96
61f8bf4
 
 
d4a9d9d
61f8bf4
 
 
 
 
8b9cf96
61f8bf4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d4a9d9d
 
 
61f8bf4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d4a9d9d
 
 
 
 
 
 
 
 
 
 
 
 
 
61f8bf4
 
 
 
 
 
8b9cf96
 
61f8bf4
 
8b9cf96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61f8bf4
8b9cf96
 
 
 
61f8bf4
 
 
 
8b9cf96
 
 
 
61f8bf4
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
---
license: apache-2.0
library_name: transformers
base_model: openai/privacy-filter
datasets:
  - nvidia/Nemotron-PII
pipeline_tag: token-classification
tags:
  - token-classification
  - pii
  - ner
  - privacy
  - redaction
  - nemotron
  - privacy-filter
  - openmed
language:
  - en
---

# privacy-filter-nemotron

Fine-tuned [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)
for **fine-grained PII extraction** across **55 categories** from
[`nvidia/Nemotron-PII`](https://huggingface.co/datasets/nvidia/Nemotron-PII).

- **Base model**: [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) β€” 1.4B-parameter MoE (50M active per token), BIOES token-classification head
- **Task**: Token classification for PII detection (BIOES scheme)
- **Training data**: Full 100K rows of `nvidia/Nemotron-PII` train split
- **Held-out val**: 10K label-stratified rows from the Nemotron `test` split (every label has β‰₯229 entities)
- **Recipe**: `opf train` (OpenAI's official fine-tuning CLI) β€” full fine-tune, AdamW, lr=1e-4, 5 epochs, bf16, weight decay 0.0
- **Labels**: 55 fine-grained PII categories β†’ 221 BIOES classes (1 `O` + 55 Γ— B/I/E/S)

The base model ships with 8 coarse PII categories (`private_person`,
`private_email`, etc.). This model trades that coarse vocabulary for a
**5Γ— more granular one** β€” `first_name`, `last_name`, `medical_record_number`,
`credit_debit_card`, `ssn`, and so on β€” matching what downstream redaction
and masking pipelines typically need.

> **Family at a glance.** Same architecture, three runtimes:
> - **PyTorch (this repo)** β€” CPU + CUDA, anywhere transformers runs.
> - **MLX BF16** β€” [`OpenMed/privacy-filter-nemotron-mlx`](https://huggingface.co/OpenMed/privacy-filter-nemotron-mlx) β€” Apple Silicon, full precision.
> - **MLX 8-bit** β€” [`OpenMed/privacy-filter-nemotron-mlx-8bit`](https://huggingface.co/OpenMed/privacy-filter-nemotron-mlx-8bit) β€” Apple Silicon, ~1.7Γ— faster.

## Quick start

### With [OpenMed](https://github.com/maziyarpanahi/openmed) β€” recommended

OpenMed gives you `extract_pii()` / `deidentify()` with built-in BIOES Viterbi
decoding, span refinement, and a Faker-backed obfuscation engine. Same call
on every host β€” Apple Silicon picks up MLX automatically; everywhere else uses
this PyTorch checkpoint.

```bash
pip install -U "openmed[hf]"
```

```python
from openmed import extract_pii, deidentify

text = (
    "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, "
    "phone 415-555-0123, email sarah.johnson@example.com."
)

# Extract grouped entity spans
result = extract_pii(text, model_name="OpenMed/privacy-filter-nemotron")
for ent in result.entities:
    print(f"{ent.label:30s} {ent.text!r}  conf={ent.confidence:.2f}")

# De-identify with any of the supported methods
masked   = deidentify(text, method="mask",   model_name="OpenMed/privacy-filter-nemotron")
removed  = deidentify(text, method="remove", model_name="OpenMed/privacy-filter-nemotron")
hashed   = deidentify(text, method="hash",   model_name="OpenMed/privacy-filter-nemotron")

# Faker-backed locale-aware obfuscation, deterministic with consistent=True+seed
fake = deidentify(
    text,
    method="replace",
    model_name="OpenMed/privacy-filter-nemotron",
    consistent=True,
    seed=42,
)
print(fake.deidentified_text)
```

`OpenMed/privacy-filter-nemotron-mlx*` model names also work in the same
`extract_pii()` / `deidentify()` calls β€” on a non-Apple-Silicon host they
automatically fall back to **this PyTorch checkpoint** with a one-time
warning. So you can ship MLX names in code and still run on Linux/Windows.

The OpenMed wrapper passes `trust_remote_code=True` for you, runs the
model's own BIOES Viterbi decoder, and skips OpenMed's regex
smart-merging (the model already produces clean spans).

### With `opf` β€” OpenAI's official CLI

```bash
pip install 'opf @ git+https://github.com/openai/privacy-filter.git'

opf redact \
  --checkpoint OpenMed/privacy-filter-nemotron \
  --text "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, phone 415-555-0123."
```

### With `transformers` directly

```python
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer

model_id = "OpenMed/privacy-filter-nemotron"
tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForTokenClassification.from_pretrained(
    model_id, trust_remote_code=True, dtype=torch.bfloat16
).to("cuda")
model.eval()

text = "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, phone 415-555-0123."
enc = tok(text, return_tensors="pt").to("cuda")
with torch.no_grad():
    out = model(**enc).logits.argmax(-1).cpu()[0].tolist()

id2label = {int(k): v for k, v in model.config.id2label.items()}
tokens = tok.convert_ids_to_tokens(enc["input_ids"][0].cpu().tolist())
for t, l in zip(tokens, out):
    if l != 0:
        print(f"{t}\t{id2label[l]}")
```

For best results use Viterbi decoding (not argmax) β€” both `opf` and OpenMed
do this by default. If you're doing argmax with the HF transformers API, you'll
see slightly more boundary errors but still excellent label accuracy.

## Performance

Evaluated with `opf eval --decode-mode viterbi --eval-mode typed --span-metrics-space char`
on the 10K label-stratified held-out val from `nvidia/Nemotron-PII:test`.

### Headline

| Metric | Value |
|---|---:|
| **Macro B-F1** (across 55 labels) | **0.9533** |
| **Token accuracy** | **0.9910** |
| Strong labels (F1 β‰₯ 0.90) | 46 / 55 |
| Acceptable (F1 0.70–0.89) | 7 / 55 |
| Weak (F1 < 0.70) | 0 / 55 |

### Per-label F1 (B-tag, sorted)

| Label | Precision | Recall | F1 |
|---|---:|---:|---:|
| 🟒 `mac_address` | 1.000 | 1.000 | **1.000** |
| 🟒 `biometric_identifier` | 0.999 | 0.998 | **0.999** |
| 🟒 `bank_routing_number` | 0.995 | 0.999 | **0.997** |
| 🟒 `credit_debit_card` | 0.999 | 0.993 | **0.996** |
| 🟒 `ipv6` | 0.992 | 1.000 | **0.996** |
| 🟒 `health_plan_beneficiary_number` | 1.000 | 0.990 | **0.995** |
| 🟒 `coordinate` | 0.994 | 0.996 | **0.995** |
| 🟒 `ipv4` | 0.993 | 0.996 | **0.994** |
| 🟒 `url` | 0.989 | 0.999 | **0.994** |
| 🟒 `email` | 0.994 | 0.993 | **0.994** |
| 🟒 `date_of_birth` | 0.992 | 0.994 | **0.993** |
| 🟒 `medical_record_number` | 0.997 | 0.989 | **0.993** |
| 🟒 `street_address` | 0.996 | 0.989 | **0.993** |
| 🟒 `vehicle_identifier` | 0.986 | 0.996 | **0.991** |
| 🟒 `license_plate` | 0.987 | 0.993 | **0.990** |
| 🟒 `customer_id` | 0.995 | 0.984 | **0.990** |
| 🟒 `http_cookie` | 0.992 | 0.983 | **0.988** |
| 🟒 `employee_id` | 0.987 | 0.988 | **0.988** |
| 🟒 `account_number` | 0.992 | 0.982 | **0.987** |
| 🟒 `certificate_license_number` | 0.989 | 0.984 | **0.987** |
| 🟒 `swift_bic` | 0.975 | 0.998 | **0.987** |
| 🟒 `postcode` | 0.991 | 0.981 | **0.986** |
| 🟒 `api_key` | 0.980 | 0.990 | **0.985** |
| 🟒 `password` | 0.999 | 0.968 | **0.983** |
| 🟒 `tax_id` | 1.000 | 0.965 | **0.982** |
| 🟒 `device_identifier` | 0.974 | 0.988 | **0.981** |
| 🟒 `national_id` | 0.991 | 0.961 | **0.976** |
| 🟒 `last_name` | 0.977 | 0.975 | **0.976** |
| 🟒 `date_time` | 0.982 | 0.967 | **0.974** |
| 🟒 `first_name` | 0.962 | 0.978 | **0.970** |
| 🟒 `pin` | 0.973 | 0.967 | **0.970** |
| 🟒 `phone_number` | 0.948 | 0.992 | **0.970** |
| 🟒 `county` | 0.986 | 0.946 | **0.965** |
| 🟒 `employment_status` | 0.960 | 0.968 | **0.964** |
| 🟒 `user_name` | 0.959 | 0.964 | **0.961** |
| 🟒 `date` | 0.967 | 0.955 | **0.961** |
| 🟒 `blood_type` | 0.922 | 0.954 | **0.938** |
| 🟒 `country` | 0.955 | 0.918 | **0.936** |
| 🟒 `ssn` | 0.926 | 0.945 | **0.935** |
| 🟒 `education_level` | 0.961 | 0.908 | **0.934** |
| 🟒 `sexuality` | 0.908 | 0.956 | **0.931** |
| 🟒 `company_name` | 0.967 | 0.894 | **0.929** |
| 🟒 `religious_belief` | 0.912 | 0.941 | **0.926** |
| 🟒 `unique_id` | 0.910 | 0.922 | **0.916** |
| 🟒 `political_view` | 0.939 | 0.872 | **0.905** |
| 🟒 `fax_number` | 0.978 | 0.841 | **0.904** |
| 🟑 `city` | 0.917 | 0.876 | **0.896** |
| 🟑 `time` | 0.933 | 0.802 | **0.863** |
| 🟑 `race_ethnicity` | 0.821 | 0.906 | **0.861** |
| 🟑 `gender` | 0.967 | 0.744 | **0.841** |
| 🟑 `state` | 0.878 | 0.785 | **0.829** |
| 🟑 `language` | 0.889 | 0.735 | **0.804** |
| 🟑 `occupation` | 0.799 | 0.667 | **0.727** |

## Label space (55 categories)

| Category | Typical examples |
|---|---|
| **Identity** | `first_name`, `last_name`, `user_name`, `age`, `gender`, `race_ethnicity`, `sexuality`, `religious_belief`, `political_view`, `marital_status`, `nationality`, `education_level`, `occupation`, `employment_status`, `language`, `blood_type`, `biometric_identifier` |
| **Contact** | `email`, `phone_number`, `fax_number`, `url` |
| **Address** | `street_address`, `city`, `county`, `state`, `country`, `postcode`, `coordinate` |
| **Dates** | `date`, `date_of_birth`, `date_time`, `time` |
| **Government IDs** | `ssn`, `national_id`, `tax_id` |
| **Financial** | `account_number`, `bank_routing_number`, `swift_bic`, `credit_debit_card`, `cvv`, `pin`, `password` |
| **Healthcare** | `medical_record_number`, `health_plan_beneficiary_number` |
| **Enterprise IDs** | `customer_id`, `employee_id`, `unique_id`, `certificate_license_number` |
| **Vehicle** | `license_plate`, `vehicle_identifier` |
| **Digital** | `ipv4`, `ipv6`, `mac_address`, `device_identifier`, `api_key`, `http_cookie` |


**Head initialization**: `opf`'s default "copy-from-matching-base" head init.
Of the 221 new BIOES classes, 5 had exact matches in the base
(`O`, `B/I/E/S-account_number`); the other 216 were copied from
semantically-adjacent coarse rows and fine-tuned end-to-end.

**Router**: base model has 128 MoE experts per layer with top-4 routing.
Routers were kept trainable during full fine-tuning; no collapse was
observed.

## Limitations & intended use

- **English-only training data.** Nemotron-PII is predominantly English
  with a 50/50 US/international locale split. Performance on non-English
  text is not guaranteed.
- **`occupation`, `language`, `gender`, `state`, `race_ethnicity`,
  `political_view`, `education_level` are fuzzier categories** than the
  strict identifiers β€” F1 lands in 0.65–0.89 vs 0.95+ for formatted
  identifiers. If your downstream only cares about strict PII, you can
  ignore low-confidence predictions on these.
- **Synthetic training data.** Nemotron-PII is a synthesized dataset; real
  clinical notes, legal documents, and web text may show different
  surface forms. For high-stakes deployments, collect a domain-specific
  eval set and re-calibrate thresholds.
- **Not a substitute for legal compliance review.** Use alongside a
  governance layer (human review, deterministic regex pre-filters, etc.).

## Credits & Acknowledgements

This model wouldn't exist without two open-source releases β€” sincere thanks
to both teams:

- **OpenAI** for [open-sourcing the Privacy Filter](https://huggingface.co/openai/privacy-filter)
  (architecture, modeling code, and `opf` training/eval CLI). Everything in
  this repo is a fine-tune on top of that release.
- **NVIDIA** for releasing the [Nemotron-PII dataset](https://huggingface.co/datasets/nvidia/Nemotron-PII)
  with its 100K-row train split and 55 fine-grained PII labels.

Additional thanks to the **HuggingFace** team for the `transformers` /
`huggingface_hub` ecosystem this model ships through.

## License

Apache 2.0, same as the base model.

## Citation

If you use this model, please cite **this model**, the organization behind
it (**OpenMed**), and the upstream base model + dataset:

```bibtex
@misc{openmed_privacy_filter_nemotron_2026,
  author       = {OpenMed},
  title        = {{OpenMed/privacy-filter-nemotron}: fine-grained PII extraction with 55 categories},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/OpenMed/privacy-filter-nemotron}}
}

@misc{openmed_2026,
  author       = {OpenMed},
  title        = {{OpenMed}: open models and resources for healthcare NLP},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/OpenMed}}
}

@misc{openai_privacy_filter_2025,
  author       = {OpenAI},
  title        = {{openai/privacy-filter}},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/openai/privacy-filter}}
}

@misc{nemotron_pii_2025,
  author       = {NVIDIA},
  title        = {{Nemotron-PII}},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/datasets/nvidia/Nemotron-PII}}
}
```