File size: 8,581 Bytes
7656d47 f914f18 7656d47 f914f18 7656d47 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 | ---
license: apache-2.0
library_name: transformers
base_model: openai/privacy-filter
datasets:
- ai4privacy/pii-masking-200k
- ai4privacy/pii-masking-400k
- ai4privacy/open-pii-masking-500k-ai4privacy
pipeline_tag: token-classification
tags:
- token-classification
- pii
- ner
- privacy
- redaction
- multilingual
- openmed
- openai-privacy-filter
language:
- ar
- bn
- de
- en
- es
- fr
- hi
- it
- ja
- ko
- nl
- pt
- te
- tr
- vi
- zh
---
# privacy-filter-multilingual
Fine-tuned [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)
for **fine-grained PII extraction** across **54 categories** in **16 languages**.
- **Base model**: [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) โ 1.4B-parameter MoE (50M active per token), BIOES token-classification head
- **Task**: Token classification for PII detection (BIOES scheme)
- **Languages (16)**: Arabic, Bengali, Chinese, Dutch, English, French, German, Hindi, Italian, Japanese, Korean, Portuguese, Spanish, Telugu, Turkish, Vietnamese
- **Training data**: Multilingual mix from [AI4Privacy](https://huggingface.co/ai4privacy) โ `pii-masking-200k`, `pii-masking-400k`, and `open-pii-masking-500k-ai4privacy`, language-balanced
- **Recipe**: `opf train` (OpenAI's official fine-tuning CLI) โ full fine-tune, AdamW, balanced language sampling, 5 epochs, bf16
- **Labels**: 54 PII categories โ 217 BIOES classes (1 `O` + 54 ร B/I/E/S)
The base model ships with 8 coarse PII categories and English-only training. This
model trades that for a **6.75ร more granular vocabulary** spanning identity,
contact, address, financial, vehicle, digital, and crypto labels โ all evaluated
across 16 languages.
> **Family at a glance.** Same architecture, three runtimes:
> - **PyTorch (this repo)** โ CPU + CUDA, anywhere transformers runs.
> - **MLX BF16** โ [`OpenMed/privacy-filter-multilingual-mlx`](https://huggingface.co/OpenMed/privacy-filter-multilingual-mlx) โ Apple Silicon, full precision.
> - **MLX 8-bit** โ [`OpenMed/privacy-filter-multilingual-mlx-8bit`](https://huggingface.co/OpenMed/privacy-filter-multilingual-mlx-8bit) โ Apple Silicon, smaller + faster.
## Quick start
### With [OpenMed](https://github.com/maziyarpanahi/openmed) โ recommended
OpenMed gives you `extract_pii()` / `deidentify()` with built-in BIOES Viterbi
decoding, span refinement, and a Faker-backed obfuscation engine. Same call
on every host โ Apple Silicon picks up MLX automatically; everywhere else uses
this PyTorch checkpoint.
```bash
pip install -U "openmed[hf]"
```
```python
from openmed import extract_pii, deidentify
text = (
"Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, "
"phone 415-555-0123, email sarah.johnson@example.com."
)
# Extract grouped entity spans
result = extract_pii(text, model_name="OpenMed/privacy-filter-multilingual")
for ent in result.entities:
print(f"{ent.label:30s} {ent.text!r} conf={ent.confidence:.2f}")
# De-identify with any of the supported methods
masked = deidentify(text, method="mask", model_name="OpenMed/privacy-filter-multilingual")
removed = deidentify(text, method="remove", model_name="OpenMed/privacy-filter-multilingual")
hashed = deidentify(text, method="hash", model_name="OpenMed/privacy-filter-multilingual")
# Faker-backed locale-aware obfuscation, deterministic with consistent=True+seed
fake = deidentify(
text,
method="replace",
model_name="OpenMed/privacy-filter-multilingual",
consistent=True,
seed=42,
)
print(fake.deidentified_text)
```
`OpenMed/privacy-filter-multilingual-mlx*` model names also work in the same
`extract_pii()` / `deidentify()` calls โ on a non-Apple-Silicon host they
automatically fall back to **this PyTorch checkpoint** with a one-time warning.
So you can ship MLX names in code and still run on Linux/Windows.
The OpenMed wrapper passes `trust_remote_code=True` for you, runs the model's
own BIOES Viterbi decoder, and skips OpenMed's regex smart-merging (the model
already produces clean spans).
## Label space (54 categories)
| Category | Typical examples |
|---|---|
| **Identity** | `FIRSTNAME`, `MIDDLENAME`, `LASTNAME`, `PREFIX`, `AGE`, `GENDER`, `SEX`, `EYECOLOR`, `HEIGHT`, `USERNAME`, `OCCUPATION`, `JOBTITLE`, `JOBDEPARTMENT`, `ORGANIZATION`, `USERAGENT` |
| **Contact** | `EMAIL`, `PHONE`, `URL` |
| **Address** | `STREET`, `BUILDINGNUMBER`, `SECONDARYADDRESS`, `CITY`, `COUNTY`, `STATE`, `ZIPCODE`, `GPSCOORDINATES`, `ORDINALDIRECTION` |
| **Dates & time** | `DATE`, `DATEOFBIRTH`, `TIME` |
| **Government IDs** | `SSN` |
| **Financial** | `ACCOUNTNAME`, `BANKACCOUNT`, `IBAN`, `BIC`, `CREDITCARD`, `CREDITCARDISSUER`, `CVV`, `PIN`, `MASKEDNUMBER`, `AMOUNT`, `CURRENCY`, `CURRENCYCODE`, `CURRENCYNAME`, `CURRENCYSYMBOL` |
| **Crypto** | `BITCOINADDRESS`, `ETHEREUMADDRESS`, `LITECOINADDRESS` |
| **Vehicle** | `VIN`, `VRM` |
| **Digital** | `IPADDRESS`, `MACADDRESS`, `IMEI` |
| **Auth** | `PASSWORD` |
The output space is `O` plus `B-`, `I-`, `E-`, `S-` for each of the 54 categories
(4 ร 54 + 1 = 217). The `id2label` mapping is shipped with the model.
## Limitations & intended use
- **Multilingual but uneven.** Strongest on languages with rich PII training
data (German, Spanish, French, Italian, Hindi, Telugu, English). CJK languages
(Japanese, Korean, Chinese) and some morphologically-marked low-resource
languages remain the main bottleneck on the current training mix.
- **Synthetic training data.** The AI4Privacy datasets are template-synthesized;
real clinical notes, legal documents, and web text may show different
surface forms. For high-stakes deployments, collect a domain-specific eval
set and re-calibrate thresholds.
- **Not a substitute for legal compliance review.** Use alongside a governance
layer (human review, deterministic regex pre-filters, etc.).
- **Not a clinical PHI model.** Healthcare-specific PHI and clinical entity
training is planned as a separate branch.
**Head initialization**: `opf`'s default "copy-from-matching-base" head init.
Of the 217 new BIOES classes, the few with exact base-vocabulary matches
(`O`, `B/I/E/S-account_name`, etc.) were copied directly; the rest were copied
from semantically-adjacent coarse rows and fine-tuned end-to-end.
**Router**: base model has 128 MoE experts per layer with top-4 routing.
Routers were kept trainable during full fine-tuning; no collapse was observed.
## Credits & Acknowledgements
This model wouldn't exist without two open-source releases โ sincere thanks
to both teams:
- **OpenAI** for [open-sourcing the Privacy Filter](https://huggingface.co/openai/privacy-filter)
(architecture, modeling code, and `opf` training/eval CLI). Everything in
this repo is a fine-tune on top of that release.
- **AI4Privacy** for releasing the multilingual PII masking datasets used as
training data:
[`pii-masking-200k`](https://huggingface.co/datasets/ai4privacy/pii-masking-200k),
[`pii-masking-400k`](https://huggingface.co/datasets/ai4privacy/pii-masking-400k),
[`open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy).
Additional thanks to the **HuggingFace** team for the `transformers` /
`huggingface_hub` ecosystem this model ships through.
## License
Apache 2.0.
## Citation
If you use this model, please cite **this model**, the organization behind it
(**OpenMed**), and the upstream base model + datasets:
```bibtex
@misc{openmed_privacy_filter_multilingual_2026,
author = {OpenMed},
title = {{OpenMed/privacy-filter-multilingual}: multilingual fine-grained PII extraction across 16 languages and 54 categories},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/OpenMed/privacy-filter-multilingual}}
}
@misc{openmed_2026,
author = {OpenMed},
title = {{OpenMed}: open models and resources for healthcare NLP},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/OpenMed}}
}
@misc{openai_privacy_filter_2025,
author = {OpenAI},
title = {{openai/privacy-filter}},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/openai/privacy-filter}}
}
@misc{ai4privacy_pii_masking,
author = {AI4Privacy},
title = {{AI4Privacy PII Masking Datasets}},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/ai4privacy}}
}
```
|