Token Classification
MLX
openmed
openai_privacy_filter
apple-silicon
pii
de-identification
privacy-filter
multilingual
File size: 8,057 Bytes
7712c3a
0837a5e
7712c3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c60afc5
7712c3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0837a5e
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
---
license: apache-2.0
base_model: OpenMed/privacy-filter-multilingual
datasets:
  - ai4privacy/pii-masking-200k
  - ai4privacy/pii-masking-400k
  - ai4privacy/open-pii-masking-500k-ai4privacy
pipeline_tag: token-classification
library_name: openmed
tags:
  - openmed
  - mlx
  - apple-silicon
  - token-classification
  - pii
  - de-identification
  - privacy-filter
  - multilingual
language:
  - ar
  - bn
  - de
  - en
  - es
  - fr
  - hi
  - it
  - ja
  - ko
  - nl
  - pt
  - te
  - tr
  - vi
  - zh
---

# OpenMed Privacy Filter (Multilingual) โ€” MLX 8-bit

A native [MLX](https://github.com/maziyarpanahi/openmed/) port of
[`OpenMed/privacy-filter-multilingual`](https://huggingface.co/OpenMed/privacy-filter-multilingual)
for fast, on-device fine-grained PII detection across **54 categories**
and **16 languages** on Apple Silicon.
This 8-bit affine-quantized artifact reduces download size and resident memory; for the full-precision sibling see [`OpenMed/privacy-filter-multilingual-mlx`](https://huggingface.co/OpenMed/privacy-filter-multilingual-mlx).

> **Family at a glance.** Same architecture and training data, three runtimes:
> - **PyTorch** โ€” [`OpenMed/privacy-filter-multilingual`](https://huggingface.co/OpenMed/privacy-filter-multilingual) โ€” CPU + CUDA.
> - **MLX BF16** โ€” [`OpenMed/privacy-filter-multilingual-mlx`](https://huggingface.co/OpenMed/privacy-filter-multilingual-mlx) โ€” Apple Silicon, full precision (~2.6 GB).
> - **MLX 8-bit** (this repo) โ€” [`OpenMed/privacy-filter-multilingual-mlx-8bit`](https://huggingface.co/OpenMed/privacy-filter-multilingual-mlx-8bit) โ€” Apple Silicon, ~1.4 GB.

## What it does

The model is a token classifier built on the OpenAI Privacy Filter
architecture (`openai_privacy_filter`). It tags each token with a BIOES
label across **54 PII span classes**, then a Viterbi pass over the BIOES
grammar yields clean entity spans. Languages covered: Arabic, Bengali,
Chinese, Dutch, English, French, German, Hindi, Italian, Japanese,
Korean, Portuguese, Spanish, Telugu, Turkish, Vietnamese.

<details>
<summary>Full label schema (217 BIOES labels)</summary>

The output space is `O` plus `B-`, `I-`, `E-`, `S-` for each of the 54
span classes (4 ร— 54 + 1 = 217). The runtime `PrivacyFilterMLXPipeline`
runs Viterbi over this BIOES grammar, so the consumer sees clean grouped
entities rather than raw token tags. The full `id2label` mapping is
shipped alongside the weights in this repo.
</details>

For per-label accuracy, training recipe, and dataset details, see the
[base PyTorch checkpoint](https://huggingface.co/OpenMed/privacy-filter-multilingual).

## Architecture

| Field | Value |
| --- | --- |
| Source model type | `openai_privacy_filter` |
| Source architecture | `OpenAIPrivacyFilterForTokenClassification` |
| Hidden size | 640 |
| Transformer layers | 8 |
| Attention | Grouped-Query (14 query heads / 2 KV heads, head_dim=64) with attention sinks |
| FFN | Sparse Mixture-of-Experts โ€” 128 experts, top-4 routing, SwiGLU |
| Position encoding | YARN-scaled RoPE (`rope_theta=150_000`, factor=32) |
| Context length | 131,072 tokens (initial 4,096) |
| Tokenizer | `o200k_base` (tiktoken) โ€” vocab 200,064 |
| Output head | Linear(640 โ†’ 217) with bias |

## File set

| File | Size | Purpose |
| --- | --- | --- |
| `weights.safetensors` | ~1.4 GB | Model weights in OpenMed-MLX layout |
| `config.json` | ~19 KB | Model + MLX runtime config |
| `id2label.json` | ~5 KB | Numeric ID โ†’ BIOES label string |
| `openmed-mlx.json` | ~1 KB | OpenMed MLX manifest (task, family, runtime hints) |
| `tokenizer.json`, `tokenizer_config.json` | ~28 MB | Source tokenizer files (kept for reference) |

The MLX runtime uses `tiktoken` `o200k_base` directly for tokenization;
the `tokenizer.json` is kept so consumers can inspect or re-tokenize via
`transformers` if desired.

## Label space (54 categories)

| Category | Typical examples |
|---|---|
| **Identity** | `FIRSTNAME`, `MIDDLENAME`, `LASTNAME`, `PREFIX`, `AGE`, `GENDER`, `SEX`, `EYECOLOR`, `HEIGHT`, `USERNAME`, `OCCUPATION`, `JOBTITLE`, `JOBDEPARTMENT`, `ORGANIZATION`, `USERAGENT` |
| **Contact** | `EMAIL`, `PHONE`, `URL` |
| **Address** | `STREET`, `BUILDINGNUMBER`, `SECONDARYADDRESS`, `CITY`, `COUNTY`, `STATE`, `ZIPCODE`, `GPSCOORDINATES`, `ORDINALDIRECTION` |
| **Dates & time** | `DATE`, `DATEOFBIRTH`, `TIME` |
| **Government IDs** | `SSN` |
| **Financial** | `ACCOUNTNAME`, `BANKACCOUNT`, `IBAN`, `BIC`, `CREDITCARD`, `CREDITCARDISSUER`, `CVV`, `PIN`, `MASKEDNUMBER`, `AMOUNT`, `CURRENCY`, `CURRENCYCODE`, `CURRENCYNAME`, `CURRENCYSYMBOL` |
| **Crypto** | `BITCOINADDRESS`, `ETHEREUMADDRESS`, `LITECOINADDRESS` |
| **Vehicle** | `VIN`, `VRM` |
| **Digital** | `IPADDRESS`, `MACADDRESS`, `IMEI` |
| **Auth** | `PASSWORD` |


## Quick start

### With [OpenMed](https://github.com/maziyarpanahi/openmed) โ€” recommended

OpenMed gives you a single `extract_pii()` / `deidentify()` API that
auto-selects MLX on Apple Silicon and PyTorch elsewhere โ€” same code on
every host.

```bash
pip install -U "openmed[mlx]"
```

```python
from openmed import extract_pii, deidentify

text = (
    "Patient Sarah Johnson (DOB 03/15/1985), phone 415-555-0123, email sarah.johnson@example.com."
)

# Extract grouped entity spans (runs on MLX here, PyTorch fallback elsewhere)
result = extract_pii(text, model_name="OpenMed/privacy-filter-multilingual-mlx-8bit")
for ent in result.entities:
    print(f"{ent.label:30s} {ent.text!r}  conf={ent.confidence:.2f}")

# De-identify
masked = deidentify(text, method="mask",
                    model_name="OpenMed/privacy-filter-multilingual-mlx-8bit")
fake   = deidentify(
    text,
    method="replace",
    model_name="OpenMed/privacy-filter-multilingual-mlx-8bit",
    consistent=True,
    seed=42,   # deterministic locale-aware Faker surrogates
)
```

When MLX isn't available (Linux, Windows, Intel Mac, missing `mlx` package),
this exact same call automatically falls back to the PyTorch checkpoint
[`OpenMed/privacy-filter-multilingual`](https://huggingface.co/OpenMed/privacy-filter-multilingual) with a one-time warning. Family-aware fallback: a Multilingual
MLX request never substitutes an unrelated baseline.

### Direct MLX usage (lower-level)

```python
from huggingface_hub import snapshot_download
from openmed.mlx.inference import PrivacyFilterMLXPipeline

model_path = snapshot_download("OpenMed/privacy-filter-multilingual-mlx-8bit")
pipe = PrivacyFilterMLXPipeline(model_path)

print(pipe("Email me at alice.smith@example.com after 5pm."))
# [{'entity_group': 'EMAIL',
#   'score': 0.92,
#   'word': 'alice.smith@example.com',
#   'start': 12,
#   'end': 35}]
```

The pipeline returns a list of dicts with `entity_group`, `score`, `word`,
`start`, and `end` (character offsets into the input string).

## Hardware notes

- Designed for Apple Silicon (M-series GPUs); CPU inference works but is slower.
- Tested on macOS with `mlx>=0.18`. The MLX runtime in this repo is
  independent of `mlx_lm` (token classification, not causal LM).
- Lower latency / smaller memory than the BF16 sibling.

## Credits & Acknowledgements

This artifact wouldn't exist without two open-source releases โ€” sincere
thanks to both teams:

- **OpenAI** for [open-sourcing the Privacy Filter](https://huggingface.co/openai/privacy-filter)
  (architecture, modeling code, and `opf` training/eval CLI). The MLX
  port in this repo runs that same architecture under Apple's MLX
  framework.
- **AI4Privacy** for releasing the multilingual PII masking datasets
  used to fine-tune the source PyTorch checkpoint:
  [`pii-masking-200k`](https://huggingface.co/datasets/ai4privacy/pii-masking-200k),
  [`pii-masking-400k`](https://huggingface.co/datasets/ai4privacy/pii-masking-400k),
  and [`open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy).

Additional thanks to **Apple** for [MLX](https://github.com/ml-explore/mlx)
and the **HuggingFace** team for the model-distribution ecosystem.

## License

Apache 2.0.