Token Classification
MLX
openmed
openai_privacy_filter
apple-silicon
pii
de-identification
privacy-filter
multilingual
MaziyarPanahi's picture
Update README.md
c60afc5 verified
---
license: apache-2.0
base_model: OpenMed/privacy-filter-multilingual
datasets:
- ai4privacy/pii-masking-200k
- ai4privacy/pii-masking-400k
- ai4privacy/open-pii-masking-500k-ai4privacy
pipeline_tag: token-classification
library_name: openmed
tags:
- openmed
- mlx
- apple-silicon
- token-classification
- pii
- de-identification
- privacy-filter
- multilingual
language:
- ar
- bn
- de
- en
- es
- fr
- hi
- it
- ja
- ko
- nl
- pt
- te
- tr
- vi
- zh
---
# OpenMed Privacy Filter (Multilingual) โ€” MLX 8-bit
A native [MLX](https://github.com/maziyarpanahi/openmed/) port of
[`OpenMed/privacy-filter-multilingual`](https://huggingface.co/OpenMed/privacy-filter-multilingual)
for fast, on-device fine-grained PII detection across **54 categories**
and **16 languages** on Apple Silicon.
This 8-bit affine-quantized artifact reduces download size and resident memory; for the full-precision sibling see [`OpenMed/privacy-filter-multilingual-mlx`](https://huggingface.co/OpenMed/privacy-filter-multilingual-mlx).
> **Family at a glance.** Same architecture and training data, three runtimes:
> - **PyTorch** โ€” [`OpenMed/privacy-filter-multilingual`](https://huggingface.co/OpenMed/privacy-filter-multilingual) โ€” CPU + CUDA.
> - **MLX BF16** โ€” [`OpenMed/privacy-filter-multilingual-mlx`](https://huggingface.co/OpenMed/privacy-filter-multilingual-mlx) โ€” Apple Silicon, full precision (~2.6 GB).
> - **MLX 8-bit** (this repo) โ€” [`OpenMed/privacy-filter-multilingual-mlx-8bit`](https://huggingface.co/OpenMed/privacy-filter-multilingual-mlx-8bit) โ€” Apple Silicon, ~1.4 GB.
## What it does
The model is a token classifier built on the OpenAI Privacy Filter
architecture (`openai_privacy_filter`). It tags each token with a BIOES
label across **54 PII span classes**, then a Viterbi pass over the BIOES
grammar yields clean entity spans. Languages covered: Arabic, Bengali,
Chinese, Dutch, English, French, German, Hindi, Italian, Japanese,
Korean, Portuguese, Spanish, Telugu, Turkish, Vietnamese.
<details>
<summary>Full label schema (217 BIOES labels)</summary>
The output space is `O` plus `B-`, `I-`, `E-`, `S-` for each of the 54
span classes (4 ร— 54 + 1 = 217). The runtime `PrivacyFilterMLXPipeline`
runs Viterbi over this BIOES grammar, so the consumer sees clean grouped
entities rather than raw token tags. The full `id2label` mapping is
shipped alongside the weights in this repo.
</details>
For per-label accuracy, training recipe, and dataset details, see the
[base PyTorch checkpoint](https://huggingface.co/OpenMed/privacy-filter-multilingual).
## Architecture
| Field | Value |
| --- | --- |
| Source model type | `openai_privacy_filter` |
| Source architecture | `OpenAIPrivacyFilterForTokenClassification` |
| Hidden size | 640 |
| Transformer layers | 8 |
| Attention | Grouped-Query (14 query heads / 2 KV heads, head_dim=64) with attention sinks |
| FFN | Sparse Mixture-of-Experts โ€” 128 experts, top-4 routing, SwiGLU |
| Position encoding | YARN-scaled RoPE (`rope_theta=150_000`, factor=32) |
| Context length | 131,072 tokens (initial 4,096) |
| Tokenizer | `o200k_base` (tiktoken) โ€” vocab 200,064 |
| Output head | Linear(640 โ†’ 217) with bias |
## File set
| File | Size | Purpose |
| --- | --- | --- |
| `weights.safetensors` | ~1.4 GB | Model weights in OpenMed-MLX layout |
| `config.json` | ~19 KB | Model + MLX runtime config |
| `id2label.json` | ~5 KB | Numeric ID โ†’ BIOES label string |
| `openmed-mlx.json` | ~1 KB | OpenMed MLX manifest (task, family, runtime hints) |
| `tokenizer.json`, `tokenizer_config.json` | ~28 MB | Source tokenizer files (kept for reference) |
The MLX runtime uses `tiktoken` `o200k_base` directly for tokenization;
the `tokenizer.json` is kept so consumers can inspect or re-tokenize via
`transformers` if desired.
## Label space (54 categories)
| Category | Typical examples |
|---|---|
| **Identity** | `FIRSTNAME`, `MIDDLENAME`, `LASTNAME`, `PREFIX`, `AGE`, `GENDER`, `SEX`, `EYECOLOR`, `HEIGHT`, `USERNAME`, `OCCUPATION`, `JOBTITLE`, `JOBDEPARTMENT`, `ORGANIZATION`, `USERAGENT` |
| **Contact** | `EMAIL`, `PHONE`, `URL` |
| **Address** | `STREET`, `BUILDINGNUMBER`, `SECONDARYADDRESS`, `CITY`, `COUNTY`, `STATE`, `ZIPCODE`, `GPSCOORDINATES`, `ORDINALDIRECTION` |
| **Dates & time** | `DATE`, `DATEOFBIRTH`, `TIME` |
| **Government IDs** | `SSN` |
| **Financial** | `ACCOUNTNAME`, `BANKACCOUNT`, `IBAN`, `BIC`, `CREDITCARD`, `CREDITCARDISSUER`, `CVV`, `PIN`, `MASKEDNUMBER`, `AMOUNT`, `CURRENCY`, `CURRENCYCODE`, `CURRENCYNAME`, `CURRENCYSYMBOL` |
| **Crypto** | `BITCOINADDRESS`, `ETHEREUMADDRESS`, `LITECOINADDRESS` |
| **Vehicle** | `VIN`, `VRM` |
| **Digital** | `IPADDRESS`, `MACADDRESS`, `IMEI` |
| **Auth** | `PASSWORD` |
## Quick start
### With [OpenMed](https://github.com/maziyarpanahi/openmed) โ€” recommended
OpenMed gives you a single `extract_pii()` / `deidentify()` API that
auto-selects MLX on Apple Silicon and PyTorch elsewhere โ€” same code on
every host.
```bash
pip install -U "openmed[mlx]"
```
```python
from openmed import extract_pii, deidentify
text = (
"Patient Sarah Johnson (DOB 03/15/1985), phone 415-555-0123, email sarah.johnson@example.com."
)
# Extract grouped entity spans (runs on MLX here, PyTorch fallback elsewhere)
result = extract_pii(text, model_name="OpenMed/privacy-filter-multilingual-mlx-8bit")
for ent in result.entities:
print(f"{ent.label:30s} {ent.text!r} conf={ent.confidence:.2f}")
# De-identify
masked = deidentify(text, method="mask",
model_name="OpenMed/privacy-filter-multilingual-mlx-8bit")
fake = deidentify(
text,
method="replace",
model_name="OpenMed/privacy-filter-multilingual-mlx-8bit",
consistent=True,
seed=42, # deterministic locale-aware Faker surrogates
)
```
When MLX isn't available (Linux, Windows, Intel Mac, missing `mlx` package),
this exact same call automatically falls back to the PyTorch checkpoint
[`OpenMed/privacy-filter-multilingual`](https://huggingface.co/OpenMed/privacy-filter-multilingual) with a one-time warning. Family-aware fallback: a Multilingual
MLX request never substitutes an unrelated baseline.
### Direct MLX usage (lower-level)
```python
from huggingface_hub import snapshot_download
from openmed.mlx.inference import PrivacyFilterMLXPipeline
model_path = snapshot_download("OpenMed/privacy-filter-multilingual-mlx-8bit")
pipe = PrivacyFilterMLXPipeline(model_path)
print(pipe("Email me at alice.smith@example.com after 5pm."))
# [{'entity_group': 'EMAIL',
# 'score': 0.92,
# 'word': 'alice.smith@example.com',
# 'start': 12,
# 'end': 35}]
```
The pipeline returns a list of dicts with `entity_group`, `score`, `word`,
`start`, and `end` (character offsets into the input string).
## Hardware notes
- Designed for Apple Silicon (M-series GPUs); CPU inference works but is slower.
- Tested on macOS with `mlx>=0.18`. The MLX runtime in this repo is
independent of `mlx_lm` (token classification, not causal LM).
- Lower latency / smaller memory than the BF16 sibling.
## Credits & Acknowledgements
This artifact wouldn't exist without two open-source releases โ€” sincere
thanks to both teams:
- **OpenAI** for [open-sourcing the Privacy Filter](https://huggingface.co/openai/privacy-filter)
(architecture, modeling code, and `opf` training/eval CLI). The MLX
port in this repo runs that same architecture under Apple's MLX
framework.
- **AI4Privacy** for releasing the multilingual PII masking datasets
used to fine-tune the source PyTorch checkpoint:
[`pii-masking-200k`](https://huggingface.co/datasets/ai4privacy/pii-masking-200k),
[`pii-masking-400k`](https://huggingface.co/datasets/ai4privacy/pii-masking-400k),
and [`open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy).
Additional thanks to **Apple** for [MLX](https://github.com/ml-explore/mlx)
and the **HuggingFace** team for the model-distribution ecosystem.
## License
Apache 2.0.