File size: 9,193 Bytes
3c125f8 32b4bb1 3c125f8 32b4bb1 3c125f8 32b4bb1 3c125f8 32b4bb1 3c125f8 32b4bb1 3c125f8 32b4bb1 3c125f8 32b4bb1 3c125f8 32b4bb1 3c125f8 32b4bb1 3c125f8 32b4bb1 3c125f8 32b4bb1 3c125f8 32b4bb1 3c125f8 32b4bb1 3c125f8 32b4bb1 3c125f8 32b4bb1 3c125f8 32b4bb1 3c125f8 32b4bb1 3c125f8 32b4bb1 3c125f8 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 | ---
license: apache-2.0
base_model: OpenMed/privacy-filter-nemotron
datasets:
- nvidia/Nemotron-PII
pipeline_tag: token-classification
library_name: openmed
tags:
- openmed
- mlx
- apple-silicon
- token-classification
- pii
- de-identification
- medical
- clinical
- privacy-filter
- nemotron
- quantized
- 8bit
language:
- en
---
# OpenMed Privacy Filter (Nemotron) β MLX 8-bit
A native [MLX](https://github.com/ml-explore/mlx) port of
[`OpenMed/privacy-filter-nemotron`](https://huggingface.co/OpenMed/privacy-filter-nemotron),
affine-quantized to **8-bit** for fast on-device PII detection on Apple
Silicon. For the unquantized BF16 reference, see
[`OpenMed/privacy-filter-nemotron-mlx`](https://huggingface.co/OpenMed/privacy-filter-nemotron-mlx).
> **Family at a glance.** Same architecture and training data, three runtimes:
> - **PyTorch** β [`OpenMed/privacy-filter-nemotron`](https://huggingface.co/OpenMed/privacy-filter-nemotron) β CPU + CUDA.
> - **MLX BF16** β [`OpenMed/privacy-filter-nemotron-mlx`](https://huggingface.co/OpenMed/privacy-filter-nemotron-mlx) β Apple Silicon, full precision (~2.6 GB).
> - **MLX 8-bit (this repo)** β Apple Silicon, ~1.4 GB, ~1.7Γ faster than BF16.
## Why 8-bit?
| | BF16 sibling | This repo (Q8) |
| --- | --- | --- |
| `weights.safetensors` size | **2.6 GB** | **1.4 GB** (-47%) |
| Forward pass (10-token PII sample) | ~14 ms | ~8 ms (~1.7Γ faster) |
| Argmax agreement vs. BF16 | (reference) | **100%** on every test sample |
| Entity-group preservation | (reference) | **identical** on every test sample |
Numbers above are from `scripts/export/verify_privacy_filter_nemotron_mlx.py`
over 10 golden PII samples (email, phone, ssn, credit card, name, ipv4,
address, date_of_birth, url, mixed). Q8 with `group_size=64` was validated
against BF16; argmax matched on 100% of tokens, all entity-group sets
matched exactly.
## What it does
The model is a token classifier built on OpenAI's open Privacy Filter
architecture (the same `openai_privacy_filter` model type used by
[`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)).
It tags each token with a BIOES label across **55 PII span classes**, then
a Viterbi pass over the BIOES grammar yields clean entity spans. Detected
categories include:
- Personal identifiers β `first_name`, `last_name`, `user_name`, `gender`, `age`, `date_of_birth`
- Contact β `email`, `phone_number`, `fax_number`, `street_address`, `city`, `state`, `country`, `county`, `postcode`, `coordinate`
- Government / legal IDs β `ssn`, `national_id`, `tax_id`, `certificate_license_number`
- Financial β `account_number`, `bank_routing_number`, `credit_debit_card`, `cvv`, `pin`, `swift_bic`
- Medical β `medical_record_number`, `health_plan_beneficiary_number`, `blood_type`
- Workplace β `company_name`, `occupation`, `employee_id`, `customer_id`, `employment_status`, `education_level`
- Online β `url`, `ipv4`, `ipv6`, `mac_address`, `http_cookie`, `api_key`, `password`, `device_identifier`
- Demographic β `race_ethnicity`, `religious_belief`, `political_view`, `sexuality`, `language`
- Vehicles β `license_plate`, `vehicle_identifier`
- Time β `date`, `date_time`, `time`
- Misc β `biometric_identifier`, `unique_id`
<details>
<summary>Full label schema (221 labels)</summary>
The output space is `O` plus `B-`, `I-`, `E-`, `S-` for each of the 55
span classes (4 Γ 55 + 1 = 221). The runtime `PrivacyFilterMLXPipeline`
runs Viterbi over this BIOES grammar, so the consumer sees clean grouped
entities rather than raw token tags.
The full `id2label.json` is shipped alongside the weights in this repo.
</details>
For per-label accuracy, training recipe, and dataset details, see the
[base PyTorch checkpoint](https://huggingface.co/OpenMed/privacy-filter-nemotron).
## Architecture
| Field | Value |
| --- | --- |
| Source model type | `openai_privacy_filter` |
| Source architecture | `OpenAIPrivacyFilterForTokenClassification` |
| Hidden size | 640 |
| Transformer layers | 8 |
| Attention | Grouped-Query (14 query heads / 2 KV heads, head_dim=64) with attention sinks |
| FFN | Sparse Mixture-of-Experts β 128 experts, top-4 routing, SwiGLU |
| Position encoding | YARN-scaled RoPE (`rope_theta=150_000`, factor=32) |
| Context length | 131,072 tokens (initial 4,096) |
| Tokenizer | `o200k_base` (tiktoken) β vocab 200,064 |
| Output head | Linear(640 β 221) with bias |
## Quantization
| Field | Value |
| --- | --- |
| Bits | **8** |
| Group size | **64** |
| Mode | **affine** (MLX `mx.quantize`, weight-only) |
| Quantized modules | `embedding`, attention `qkv` & `out`, MoE `gate`, expert `swiglu` & `out`, `unembedding` |
| Non-quantized modules | RMSNorms, attention sinks (kept in BF16) |
Expert tensors are stored in MLX's packed transposed layout and run through
`mx.gather_qmm` at inference time. RMSNorm scales and attention sinks
remain BF16 because their parameter count is negligible relative to the
rest of the model.
## File set
| File | Size | Purpose |
| --- | --- | --- |
| `weights.safetensors` | 1.4 GB | Q8 packed weights + scales/biases (uint32 packed for quantized modules, BF16 for norms/sinks) |
| `config.json` | 20 KB | Model + MLX runtime config (with `_mlx_quantization` block) |
| `id2label.json` | 5.4 KB | Numeric ID β BIOES label string |
| `openmed-mlx.json` | 0.8 KB | OpenMed MLX manifest with `quantization: {bits: 8, group_size: 64, mode: affine}` |
| `tokenizer.json`, `tokenizer_config.json` | 27 MB | Source tokenizer files (kept for reference) |
The MLX runtime uses `tiktoken` `o200k_base` directly for tokenization;
the `tokenizer.json` is kept so consumers can inspect or re-tokenize via
`transformers` if desired.
## Quick start
### With [OpenMed](https://github.com/maziyarpanahi/openmed) β recommended
OpenMed gives you a single `extract_pii()` / `deidentify()` API that
auto-selects MLX on Apple Silicon and PyTorch elsewhere β same code on
every host.
```bash
pip install -U "openmed[mlx]"
```
```python
from openmed import extract_pii, deidentify
text = (
"Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, "
"phone 415-555-0123, email sarah.johnson@example.com."
)
# Extract grouped entity spans (runs on MLX 8-bit here, PyTorch fallback elsewhere)
result = extract_pii(text, model_name="OpenMed/privacy-filter-nemotron-mlx-8bit")
for ent in result.entities:
print(f"{ent.label:30s} {ent.text!r} conf={ent.confidence:.2f}")
# De-identify
masked = deidentify(text, method="mask",
model_name="OpenMed/privacy-filter-nemotron-mlx-8bit")
fake = deidentify(
text,
method="replace",
model_name="OpenMed/privacy-filter-nemotron-mlx-8bit",
consistent=True,
seed=42, # deterministic locale-aware Faker surrogates
)
```
When MLX isn't available (Linux, Windows, Intel Mac, missing `mlx` package),
this exact same call automatically falls back to the PyTorch checkpoint
[`OpenMed/privacy-filter-nemotron`](https://huggingface.co/OpenMed/privacy-filter-nemotron)
with a one-time warning. Family-aware fallback: a Nemotron MLX request never
substitutes the unrelated `openai/privacy-filter` baseline.
### Direct MLX usage (lower-level)
```python
from huggingface_hub import snapshot_download
from openmed.mlx.inference import PrivacyFilterMLXPipeline
model_path = snapshot_download("OpenMed/privacy-filter-nemotron-mlx-8bit")
pipe = PrivacyFilterMLXPipeline(model_path)
print(pipe("Email me at alice.smith@example.com after 5pm."))
# [{'entity_group': 'email',
# 'score': 0.92,
# 'word': 'alice.smith@example.com',
# 'start': 12,
# 'end': 35}]
```
The pipeline returns a list of dicts with `entity_group`, `score`, `word`,
`start`, and `end` (character offsets into the input string).
### Loading from a local snapshot
```python
from openmed.mlx.models import load_model
import mlx.core as mx
model = load_model("/path/to/privacy-filter-nemotron-mlx-8bit")
ids = mx.array([[1, 100, 200, 300]], dtype=mx.int32)
mask = mx.ones((1, 4), dtype=mx.bool_)
logits = model(ids, attention_mask=mask) # shape (1, 4, 221)
```
## Hardware notes
- Designed for Apple Silicon (M-series GPUs); CPU inference works but is slower.
- Tested on macOS with `mlx>=0.18`.
- Q8 inference is ~1.7Γ faster than the BF16 sibling on the same hardware
while preserving 100% argmax agreement on the test set.
## Credits & Acknowledgements
This model wouldn't exist without two open-source releases β sincere
thanks to both teams:
- **OpenAI** for [open-sourcing the Privacy Filter](https://huggingface.co/openai/privacy-filter)
(architecture, modeling code, and `opf` training/eval CLI). The 8-bit
MLX port in this repo runs that same architecture under Apple's MLX
framework with affine weight-only quantization.
- **NVIDIA** for releasing the [Nemotron-PII dataset](https://huggingface.co/datasets/nvidia/Nemotron-PII)
used to fine-tune the source PyTorch checkpoint.
Additional thanks to **Apple** for [MLX](https://github.com/ml-explore/mlx)
and the **HuggingFace** team for the model-distribution ecosystem.
## License
Apache 2.0 (matches the source checkpoint).
|