File size: 9,193 Bytes
3c125f8
 
 
32b4bb1
 
3c125f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32b4bb1
 
3c125f8
 
 
 
32b4bb1
 
 
 
 
3c125f8
32b4bb1
 
 
 
3c125f8
 
 
 
 
 
 
 
 
 
32b4bb1
 
 
 
 
3c125f8
 
 
32b4bb1
 
 
 
 
 
3c125f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32b4bb1
 
 
 
3c125f8
 
 
 
32b4bb1
 
 
3c125f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32b4bb1
 
 
 
3c125f8
 
 
 
 
 
 
 
 
 
 
32b4bb1
 
 
 
 
 
 
3c125f8
32b4bb1
 
 
3c125f8
 
32b4bb1
3c125f8
 
32b4bb1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c125f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32b4bb1
 
3c125f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32b4bb1
 
 
 
 
3c125f8
32b4bb1
 
3c125f8
32b4bb1
 
 
 
 
 
3c125f8
32b4bb1
 
3c125f8
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
---
license: apache-2.0
base_model: OpenMed/privacy-filter-nemotron
datasets:
  - nvidia/Nemotron-PII
pipeline_tag: token-classification
library_name: openmed
tags:
  - openmed
  - mlx
  - apple-silicon
  - token-classification
  - pii
  - de-identification
  - medical
  - clinical
  - privacy-filter
  - nemotron
  - quantized
  - 8bit
language:
  - en
---

# OpenMed Privacy Filter (Nemotron) β€” MLX 8-bit

A native [MLX](https://github.com/ml-explore/mlx) port of
[`OpenMed/privacy-filter-nemotron`](https://huggingface.co/OpenMed/privacy-filter-nemotron),
affine-quantized to **8-bit** for fast on-device PII detection on Apple
Silicon. For the unquantized BF16 reference, see
[`OpenMed/privacy-filter-nemotron-mlx`](https://huggingface.co/OpenMed/privacy-filter-nemotron-mlx).

> **Family at a glance.** Same architecture and training data, three runtimes:
> - **PyTorch** β€” [`OpenMed/privacy-filter-nemotron`](https://huggingface.co/OpenMed/privacy-filter-nemotron) β€” CPU + CUDA.
> - **MLX BF16** β€” [`OpenMed/privacy-filter-nemotron-mlx`](https://huggingface.co/OpenMed/privacy-filter-nemotron-mlx) β€” Apple Silicon, full precision (~2.6 GB).
> - **MLX 8-bit (this repo)** β€” Apple Silicon, ~1.4 GB, ~1.7Γ— faster than BF16.

## Why 8-bit?

| | BF16 sibling | This repo (Q8) |
| --- | --- | --- |
| `weights.safetensors` size | **2.6 GB** | **1.4 GB** (-47%) |
| Forward pass (10-token PII sample) | ~14 ms | ~8 ms (~1.7Γ— faster) |
| Argmax agreement vs. BF16 | (reference) | **100%** on every test sample |
| Entity-group preservation | (reference) | **identical** on every test sample |

Numbers above are from `scripts/export/verify_privacy_filter_nemotron_mlx.py`
over 10 golden PII samples (email, phone, ssn, credit card, name, ipv4,
address, date_of_birth, url, mixed). Q8 with `group_size=64` was validated
against BF16; argmax matched on 100% of tokens, all entity-group sets
matched exactly.

## What it does

The model is a token classifier built on OpenAI's open Privacy Filter
architecture (the same `openai_privacy_filter` model type used by
[`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)).
It tags each token with a BIOES label across **55 PII span classes**, then
a Viterbi pass over the BIOES grammar yields clean entity spans. Detected
categories include:

- Personal identifiers β€” `first_name`, `last_name`, `user_name`, `gender`, `age`, `date_of_birth`
- Contact β€” `email`, `phone_number`, `fax_number`, `street_address`, `city`, `state`, `country`, `county`, `postcode`, `coordinate`
- Government / legal IDs β€” `ssn`, `national_id`, `tax_id`, `certificate_license_number`
- Financial β€” `account_number`, `bank_routing_number`, `credit_debit_card`, `cvv`, `pin`, `swift_bic`
- Medical β€” `medical_record_number`, `health_plan_beneficiary_number`, `blood_type`
- Workplace β€” `company_name`, `occupation`, `employee_id`, `customer_id`, `employment_status`, `education_level`
- Online β€” `url`, `ipv4`, `ipv6`, `mac_address`, `http_cookie`, `api_key`, `password`, `device_identifier`
- Demographic β€” `race_ethnicity`, `religious_belief`, `political_view`, `sexuality`, `language`
- Vehicles β€” `license_plate`, `vehicle_identifier`
- Time β€” `date`, `date_time`, `time`
- Misc β€” `biometric_identifier`, `unique_id`

<details>
<summary>Full label schema (221 labels)</summary>

The output space is `O` plus `B-`, `I-`, `E-`, `S-` for each of the 55
span classes (4 Γ— 55 + 1 = 221). The runtime `PrivacyFilterMLXPipeline`
runs Viterbi over this BIOES grammar, so the consumer sees clean grouped
entities rather than raw token tags.

The full `id2label.json` is shipped alongside the weights in this repo.
</details>

For per-label accuracy, training recipe, and dataset details, see the
[base PyTorch checkpoint](https://huggingface.co/OpenMed/privacy-filter-nemotron).

## Architecture

| Field | Value |
| --- | --- |
| Source model type | `openai_privacy_filter` |
| Source architecture | `OpenAIPrivacyFilterForTokenClassification` |
| Hidden size | 640 |
| Transformer layers | 8 |
| Attention | Grouped-Query (14 query heads / 2 KV heads, head_dim=64) with attention sinks |
| FFN | Sparse Mixture-of-Experts β€” 128 experts, top-4 routing, SwiGLU |
| Position encoding | YARN-scaled RoPE (`rope_theta=150_000`, factor=32) |
| Context length | 131,072 tokens (initial 4,096) |
| Tokenizer | `o200k_base` (tiktoken) β€” vocab 200,064 |
| Output head | Linear(640 β†’ 221) with bias |

## Quantization

| Field | Value |
| --- | --- |
| Bits | **8** |
| Group size | **64** |
| Mode | **affine** (MLX `mx.quantize`, weight-only) |
| Quantized modules | `embedding`, attention `qkv` & `out`, MoE `gate`, expert `swiglu` & `out`, `unembedding` |
| Non-quantized modules | RMSNorms, attention sinks (kept in BF16) |

Expert tensors are stored in MLX's packed transposed layout and run through
`mx.gather_qmm` at inference time. RMSNorm scales and attention sinks
remain BF16 because their parameter count is negligible relative to the
rest of the model.

## File set

| File | Size | Purpose |
| --- | --- | --- |
| `weights.safetensors` | 1.4 GB | Q8 packed weights + scales/biases (uint32 packed for quantized modules, BF16 for norms/sinks) |
| `config.json` | 20 KB | Model + MLX runtime config (with `_mlx_quantization` block) |
| `id2label.json` | 5.4 KB | Numeric ID β†’ BIOES label string |
| `openmed-mlx.json` | 0.8 KB | OpenMed MLX manifest with `quantization: {bits: 8, group_size: 64, mode: affine}` |
| `tokenizer.json`, `tokenizer_config.json` | 27 MB | Source tokenizer files (kept for reference) |

The MLX runtime uses `tiktoken` `o200k_base` directly for tokenization;
the `tokenizer.json` is kept so consumers can inspect or re-tokenize via
`transformers` if desired.

## Quick start

### With [OpenMed](https://github.com/maziyarpanahi/openmed) β€” recommended

OpenMed gives you a single `extract_pii()` / `deidentify()` API that
auto-selects MLX on Apple Silicon and PyTorch elsewhere β€” same code on
every host.

```bash
pip install -U "openmed[mlx]"
```

```python
from openmed import extract_pii, deidentify

text = (
    "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, "
    "phone 415-555-0123, email sarah.johnson@example.com."
)

# Extract grouped entity spans (runs on MLX 8-bit here, PyTorch fallback elsewhere)
result = extract_pii(text, model_name="OpenMed/privacy-filter-nemotron-mlx-8bit")
for ent in result.entities:
    print(f"{ent.label:30s} {ent.text!r}  conf={ent.confidence:.2f}")

# De-identify
masked = deidentify(text, method="mask",
                    model_name="OpenMed/privacy-filter-nemotron-mlx-8bit")
fake   = deidentify(
    text,
    method="replace",
    model_name="OpenMed/privacy-filter-nemotron-mlx-8bit",
    consistent=True,
    seed=42,   # deterministic locale-aware Faker surrogates
)
```

When MLX isn't available (Linux, Windows, Intel Mac, missing `mlx` package),
this exact same call automatically falls back to the PyTorch checkpoint
[`OpenMed/privacy-filter-nemotron`](https://huggingface.co/OpenMed/privacy-filter-nemotron)
with a one-time warning. Family-aware fallback: a Nemotron MLX request never
substitutes the unrelated `openai/privacy-filter` baseline.

### Direct MLX usage (lower-level)

```python
from huggingface_hub import snapshot_download
from openmed.mlx.inference import PrivacyFilterMLXPipeline

model_path = snapshot_download("OpenMed/privacy-filter-nemotron-mlx-8bit")
pipe = PrivacyFilterMLXPipeline(model_path)

print(pipe("Email me at alice.smith@example.com after 5pm."))
# [{'entity_group': 'email',
#   'score': 0.92,
#   'word': 'alice.smith@example.com',
#   'start': 12,
#   'end': 35}]
```

The pipeline returns a list of dicts with `entity_group`, `score`, `word`,
`start`, and `end` (character offsets into the input string).

### Loading from a local snapshot

```python
from openmed.mlx.models import load_model
import mlx.core as mx

model = load_model("/path/to/privacy-filter-nemotron-mlx-8bit")
ids = mx.array([[1, 100, 200, 300]], dtype=mx.int32)
mask = mx.ones((1, 4), dtype=mx.bool_)
logits = model(ids, attention_mask=mask)   # shape (1, 4, 221)
```

## Hardware notes

- Designed for Apple Silicon (M-series GPUs); CPU inference works but is slower.
- Tested on macOS with `mlx>=0.18`.
- Q8 inference is ~1.7Γ— faster than the BF16 sibling on the same hardware
  while preserving 100% argmax agreement on the test set.

## Credits & Acknowledgements

This model wouldn't exist without two open-source releases β€” sincere
thanks to both teams:

- **OpenAI** for [open-sourcing the Privacy Filter](https://huggingface.co/openai/privacy-filter)
  (architecture, modeling code, and `opf` training/eval CLI). The 8-bit
  MLX port in this repo runs that same architecture under Apple's MLX
  framework with affine weight-only quantization.
- **NVIDIA** for releasing the [Nemotron-PII dataset](https://huggingface.co/datasets/nvidia/Nemotron-PII)
  used to fine-tune the source PyTorch checkpoint.

Additional thanks to **Apple** for [MLX](https://github.com/ml-explore/mlx)
and the **HuggingFace** team for the model-distribution ecosystem.

## License

Apache 2.0 (matches the source checkpoint).