Token Classification
Transformers
Safetensors
openai_privacy_filter
pii
ner
privacy
redaction
multilingual
openmed
openai-privacy-filter
File size: 8,581 Bytes
7656d47
f914f18
7656d47
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f914f18
 
7656d47
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
---
license: apache-2.0
library_name: transformers
base_model: openai/privacy-filter
datasets:
  - ai4privacy/pii-masking-200k
  - ai4privacy/pii-masking-400k
  - ai4privacy/open-pii-masking-500k-ai4privacy
pipeline_tag: token-classification
tags:
  - token-classification
  - pii
  - ner
  - privacy
  - redaction
  - multilingual
  - openmed
  - openai-privacy-filter
language:
  - ar
  - bn
  - de
  - en
  - es
  - fr
  - hi
  - it
  - ja
  - ko
  - nl
  - pt
  - te
  - tr
  - vi
  - zh
---

# privacy-filter-multilingual

Fine-tuned [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)
for **fine-grained PII extraction** across **54 categories** in **16 languages**.

- **Base model**: [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) โ€” 1.4B-parameter MoE (50M active per token), BIOES token-classification head
- **Task**: Token classification for PII detection (BIOES scheme)
- **Languages (16)**: Arabic, Bengali, Chinese, Dutch, English, French, German, Hindi, Italian, Japanese, Korean, Portuguese, Spanish, Telugu, Turkish, Vietnamese
- **Training data**: Multilingual mix from [AI4Privacy](https://huggingface.co/ai4privacy) โ€” `pii-masking-200k`, `pii-masking-400k`, and `open-pii-masking-500k-ai4privacy`, language-balanced
- **Recipe**: `opf train` (OpenAI's official fine-tuning CLI) โ€” full fine-tune, AdamW, balanced language sampling, 5 epochs, bf16
- **Labels**: 54 PII categories โ†’ 217 BIOES classes (1 `O` + 54 ร— B/I/E/S)

The base model ships with 8 coarse PII categories and English-only training. This
model trades that for a **6.75ร— more granular vocabulary** spanning identity,
contact, address, financial, vehicle, digital, and crypto labels โ€” all evaluated
across 16 languages.

> **Family at a glance.** Same architecture, three runtimes:
> - **PyTorch (this repo)** โ€” CPU + CUDA, anywhere transformers runs.
> - **MLX BF16** โ€” [`OpenMed/privacy-filter-multilingual-mlx`](https://huggingface.co/OpenMed/privacy-filter-multilingual-mlx) โ€” Apple Silicon, full precision.
> - **MLX 8-bit** โ€” [`OpenMed/privacy-filter-multilingual-mlx-8bit`](https://huggingface.co/OpenMed/privacy-filter-multilingual-mlx-8bit) โ€” Apple Silicon, smaller + faster.

## Quick start

### With [OpenMed](https://github.com/maziyarpanahi/openmed) โ€” recommended

OpenMed gives you `extract_pii()` / `deidentify()` with built-in BIOES Viterbi
decoding, span refinement, and a Faker-backed obfuscation engine. Same call
on every host โ€” Apple Silicon picks up MLX automatically; everywhere else uses
this PyTorch checkpoint.

```bash
pip install -U "openmed[hf]"
```

```python
from openmed import extract_pii, deidentify

text = (
    "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, "
    "phone 415-555-0123, email sarah.johnson@example.com."
)

# Extract grouped entity spans
result = extract_pii(text, model_name="OpenMed/privacy-filter-multilingual")
for ent in result.entities:
    print(f"{ent.label:30s} {ent.text!r}  conf={ent.confidence:.2f}")

# De-identify with any of the supported methods
masked   = deidentify(text, method="mask",   model_name="OpenMed/privacy-filter-multilingual")
removed  = deidentify(text, method="remove", model_name="OpenMed/privacy-filter-multilingual")
hashed   = deidentify(text, method="hash",   model_name="OpenMed/privacy-filter-multilingual")

# Faker-backed locale-aware obfuscation, deterministic with consistent=True+seed
fake = deidentify(
    text,
    method="replace",
    model_name="OpenMed/privacy-filter-multilingual",
    consistent=True,
    seed=42,
)
print(fake.deidentified_text)
```

`OpenMed/privacy-filter-multilingual-mlx*` model names also work in the same
`extract_pii()` / `deidentify()` calls โ€” on a non-Apple-Silicon host they
automatically fall back to **this PyTorch checkpoint** with a one-time warning.
So you can ship MLX names in code and still run on Linux/Windows.

The OpenMed wrapper passes `trust_remote_code=True` for you, runs the model's
own BIOES Viterbi decoder, and skips OpenMed's regex smart-merging (the model
already produces clean spans).

## Label space (54 categories)

| Category | Typical examples |
|---|---|
| **Identity** | `FIRSTNAME`, `MIDDLENAME`, `LASTNAME`, `PREFIX`, `AGE`, `GENDER`, `SEX`, `EYECOLOR`, `HEIGHT`, `USERNAME`, `OCCUPATION`, `JOBTITLE`, `JOBDEPARTMENT`, `ORGANIZATION`, `USERAGENT` |
| **Contact** | `EMAIL`, `PHONE`, `URL` |
| **Address** | `STREET`, `BUILDINGNUMBER`, `SECONDARYADDRESS`, `CITY`, `COUNTY`, `STATE`, `ZIPCODE`, `GPSCOORDINATES`, `ORDINALDIRECTION` |
| **Dates & time** | `DATE`, `DATEOFBIRTH`, `TIME` |
| **Government IDs** | `SSN` |
| **Financial** | `ACCOUNTNAME`, `BANKACCOUNT`, `IBAN`, `BIC`, `CREDITCARD`, `CREDITCARDISSUER`, `CVV`, `PIN`, `MASKEDNUMBER`, `AMOUNT`, `CURRENCY`, `CURRENCYCODE`, `CURRENCYNAME`, `CURRENCYSYMBOL` |
| **Crypto** | `BITCOINADDRESS`, `ETHEREUMADDRESS`, `LITECOINADDRESS` |
| **Vehicle** | `VIN`, `VRM` |
| **Digital** | `IPADDRESS`, `MACADDRESS`, `IMEI` |
| **Auth** | `PASSWORD` |

The output space is `O` plus `B-`, `I-`, `E-`, `S-` for each of the 54 categories
(4 ร— 54 + 1 = 217). The `id2label` mapping is shipped with the model.

## Limitations & intended use

- **Multilingual but uneven.** Strongest on languages with rich PII training
  data (German, Spanish, French, Italian, Hindi, Telugu, English). CJK languages
  (Japanese, Korean, Chinese) and some morphologically-marked low-resource
  languages remain the main bottleneck on the current training mix.
- **Synthetic training data.** The AI4Privacy datasets are template-synthesized;
  real clinical notes, legal documents, and web text may show different
  surface forms. For high-stakes deployments, collect a domain-specific eval
  set and re-calibrate thresholds.
- **Not a substitute for legal compliance review.** Use alongside a governance
  layer (human review, deterministic regex pre-filters, etc.).
- **Not a clinical PHI model.** Healthcare-specific PHI and clinical entity
  training is planned as a separate branch.

**Head initialization**: `opf`'s default "copy-from-matching-base" head init.
Of the 217 new BIOES classes, the few with exact base-vocabulary matches
(`O`, `B/I/E/S-account_name`, etc.) were copied directly; the rest were copied
from semantically-adjacent coarse rows and fine-tuned end-to-end.

**Router**: base model has 128 MoE experts per layer with top-4 routing.
Routers were kept trainable during full fine-tuning; no collapse was observed.

## Credits & Acknowledgements

This model wouldn't exist without two open-source releases โ€” sincere thanks
to both teams:

- **OpenAI** for [open-sourcing the Privacy Filter](https://huggingface.co/openai/privacy-filter)
  (architecture, modeling code, and `opf` training/eval CLI). Everything in
  this repo is a fine-tune on top of that release.
- **AI4Privacy** for releasing the multilingual PII masking datasets used as
  training data:
  [`pii-masking-200k`](https://huggingface.co/datasets/ai4privacy/pii-masking-200k),
  [`pii-masking-400k`](https://huggingface.co/datasets/ai4privacy/pii-masking-400k),
  [`open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy).

Additional thanks to the **HuggingFace** team for the `transformers` /
`huggingface_hub` ecosystem this model ships through.

## License

Apache 2.0.


## Citation

If you use this model, please cite **this model**, the organization behind it
(**OpenMed**), and the upstream base model + datasets:

```bibtex
@misc{openmed_privacy_filter_multilingual_2026,
  author       = {OpenMed},
  title        = {{OpenMed/privacy-filter-multilingual}: multilingual fine-grained PII extraction across 16 languages and 54 categories},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/OpenMed/privacy-filter-multilingual}}
}

@misc{openmed_2026,
  author       = {OpenMed},
  title        = {{OpenMed}: open models and resources for healthcare NLP},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/OpenMed}}
}

@misc{openai_privacy_filter_2025,
  author       = {OpenAI},
  title        = {{openai/privacy-filter}},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/openai/privacy-filter}}
}

@misc{ai4privacy_pii_masking,
  author       = {AI4Privacy},
  title        = {{AI4Privacy PII Masking Datasets}},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ai4privacy}}
}
```