File size: 9,193 Bytes

---
license: apache-2.0
base_model: OpenMed/privacy-filter-nemotron
datasets:
  - nvidia/Nemotron-PII
pipeline_tag: token-classification
library_name: openmed
tags:
  - openmed
  - mlx
  - apple-silicon
  - token-classification
  - pii
  - de-identification
  - medical
  - clinical
  - privacy-filter
  - nemotron
  - quantized
  - 8bit
language:
  - en
---

# OpenMed Privacy Filter (Nemotron) — MLX 8-bit

A native [MLX](https://github.com/ml-explore/mlx) port of
[`OpenMed/privacy-filter-nemotron`](https://huggingface.co/OpenMed/privacy-filter-nemotron),
affine-quantized to **8-bit** for fast on-device PII detection on Apple
Silicon. For the unquantized BF16 reference, see
[`OpenMed/privacy-filter-nemotron-mlx`](https://huggingface.co/OpenMed/privacy-filter-nemotron-mlx).

> **Family at a glance.** Same architecture and training data, three runtimes:
> - **PyTorch** — [`OpenMed/privacy-filter-nemotron`](https://huggingface.co/OpenMed/privacy-filter-nemotron) — CPU + CUDA.
> - **MLX BF16** — [`OpenMed/privacy-filter-nemotron-mlx`](https://huggingface.co/OpenMed/privacy-filter-nemotron-mlx) — Apple Silicon, full precision (~2.6 GB).
> - **MLX 8-bit (this repo)** — Apple Silicon, ~1.4 GB, ~1.7× faster than BF16.

## Why 8-bit?

| | BF16 sibling | This repo (Q8) |
| --- | --- | --- |
| `weights.safetensors` size | **2.6 GB** | **1.4 GB** (-47%) |
| Forward pass (10-token PII sample) | ~14 ms | ~8 ms (~1.7× faster) |
| Argmax agreement vs. BF16 | (reference) | **100%** on every test sample |
| Entity-group preservation | (reference) | **identical** on every test sample |

Numbers above are from `scripts/export/verify_privacy_filter_nemotron_mlx.py`
over 10 golden PII samples (email, phone, ssn, credit card, name, ipv4,
address, date_of_birth, url, mixed). Q8 with `group_size=64` was validated
against BF16; argmax matched on 100% of tokens, all entity-group sets
matched exactly.

## What it does

The model is a token classifier built on OpenAI's open Privacy Filter
architecture (the same `openai_privacy_filter` model type used by
[`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)).
It tags each token with a BIOES label across **55 PII span classes**, then
a Viterbi pass over the BIOES grammar yields clean entity spans. Detected
categories include:

- Personal identifiers — `first_name`, `last_name`, `user_name`, `gender`, `age`, `date_of_birth`
- Contact — `email`, `phone_number`, `fax_number`, `street_address`, `city`, `state`, `country`, `county`, `postcode`, `coordinate`
- Government / legal IDs — `ssn`, `national_id`, `tax_id`, `certificate_license_number`
- Financial — `account_number`, `bank_routing_number`, `credit_debit_card`, `cvv`, `pin`, `swift_bic`
- Medical — `medical_record_number`, `health_plan_beneficiary_number`, `blood_type`
- Workplace — `company_name`, `occupation`, `employee_id`, `customer_id`, `employment_status`, `education_level`
- Online — `url`, `ipv4`, `ipv6`, `mac_address`, `http_cookie`, `api_key`, `password`, `device_identifier`
- Demographic — `race_ethnicity`, `religious_belief`, `political_view`, `sexuality`, `language`
- Vehicles — `license_plate`, `vehicle_identifier`
- Time — `date`, `date_time`, `time`
- Misc — `biometric_identifier`, `unique_id`

<details>
<summary>Full label schema (221 labels)</summary>

The output space is `O` plus `B-`, `I-`, `E-`, `S-` for each of the 55
span classes (4 × 55 + 1 = 221). The runtime `PrivacyFilterMLXPipeline`
runs Viterbi over this BIOES grammar, so the consumer sees clean grouped
entities rather than raw token tags.

The full `id2label.json` is shipped alongside the weights in this repo.
</details>

For per-label accuracy, training recipe, and dataset details, see the
[base PyTorch checkpoint](https://huggingface.co/OpenMed/privacy-filter-nemotron).

## Architecture

| Field | Value |
| --- | --- |
| Source model type | `openai_privacy_filter` |
| Source architecture | `OpenAIPrivacyFilterForTokenClassification` |
| Hidden size | 640 |
| Transformer layers | 8 |
| Attention | Grouped-Query (14 query heads / 2 KV heads, head_dim=64) with attention sinks |
| FFN | Sparse Mixture-of-Experts — 128 experts, top-4 routing, SwiGLU |
| Position encoding | YARN-scaled RoPE (`rope_theta=150_000`, factor=32) |
| Context length | 131,072 tokens (initial 4,096) |
| Tokenizer | `o200k_base` (tiktoken) — vocab 200,064 |
| Output head | Linear(640 → 221) with bias |

## Quantization

| Field | Value |
| --- | --- |
| Bits | **8** |
| Group size | **64** |
| Mode | **affine** (MLX `mx.quantize`, weight-only) |
| Quantized modules | `embedding`, attention `qkv` & `out`, MoE `gate`, expert `swiglu` & `out`, `unembedding` |
| Non-quantized modules | RMSNorms, attention sinks (kept in BF16) |

Expert tensors are stored in MLX's packed transposed layout and run through
`mx.gather_qmm` at inference time. RMSNorm scales and attention sinks
remain BF16 because their parameter count is negligible relative to the
rest of the model.

## File set

| File | Size | Purpose |
| --- | --- | --- |
| `weights.safetensors` | 1.4 GB | Q8 packed weights + scales/biases (uint32 packed for quantized modules, BF16 for norms/sinks) |
| `config.json` | 20 KB | Model + MLX runtime config (with `_mlx_quantization` block) |
| `id2label.json` | 5.4 KB | Numeric ID → BIOES label string |
| `openmed-mlx.json` | 0.8 KB | OpenMed MLX manifest with `quantization: {bits: 8, group_size: 64, mode: affine}` |
| `tokenizer.json`, `tokenizer_config.json` | 27 MB | Source tokenizer files (kept for reference) |

The MLX runtime uses `tiktoken` `o200k_base` directly for tokenization;
the `tokenizer.json` is kept so consumers can inspect or re-tokenize via
`transformers` if desired.

## Quick start

### With [OpenMed](https://github.com/maziyarpanahi/openmed) — recommended

OpenMed gives you a single `extract_pii()` / `deidentify()` API that
auto-selects MLX on Apple Silicon and PyTorch elsewhere — same code on
every host.

```bash
pip install -U "openmed[mlx]"
```

```python
from openmed import extract_pii, deidentify

text = (
    "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, "
    "phone 415-555-0123, email sarah.johnson@example.com."
)

# Extract grouped entity spans (runs on MLX 8-bit here, PyTorch fallback elsewhere)
result = extract_pii(text, model_name="OpenMed/privacy-filter-nemotron-mlx-8bit")
for ent in result.entities:
    print(f"{ent.label:30s} {ent.text!r}  conf={ent.confidence:.2f}")

# De-identify
masked = deidentify(text, method="mask",
                    model_name="OpenMed/privacy-filter-nemotron-mlx-8bit")
fake   = deidentify(
    text,
    method="replace",
    model_name="OpenMed/privacy-filter-nemotron-mlx-8bit",
    consistent=True,
    seed=42,   # deterministic locale-aware Faker surrogates
)
```

When MLX isn't available (Linux, Windows, Intel Mac, missing `mlx` package),
this exact same call automatically falls back to the PyTorch checkpoint
[`OpenMed/privacy-filter-nemotron`](https://huggingface.co/OpenMed/privacy-filter-nemotron)
with a one-time warning. Family-aware fallback: a Nemotron MLX request never
substitutes the unrelated `openai/privacy-filter` baseline.

### Direct MLX usage (lower-level)

```python
from huggingface_hub import snapshot_download
from openmed.mlx.inference import PrivacyFilterMLXPipeline

model_path = snapshot_download("OpenMed/privacy-filter-nemotron-mlx-8bit")
pipe = PrivacyFilterMLXPipeline(model_path)

print(pipe("Email me at alice.smith@example.com after 5pm."))
# [{'entity_group': 'email',
#   'score': 0.92,
#   'word': 'alice.smith@example.com',
#   'start': 12,
#   'end': 35}]
```

The pipeline returns a list of dicts with `entity_group`, `score`, `word`,
`start`, and `end` (character offsets into the input string).

### Loading from a local snapshot

```python
from openmed.mlx.models import load_model
import mlx.core as mx

model = load_model("/path/to/privacy-filter-nemotron-mlx-8bit")
ids = mx.array([[1, 100, 200, 300]], dtype=mx.int32)
mask = mx.ones((1, 4), dtype=mx.bool_)
logits = model(ids, attention_mask=mask)   # shape (1, 4, 221)
```

## Hardware notes

- Designed for Apple Silicon (M-series GPUs); CPU inference works but is slower.
- Tested on macOS with `mlx>=0.18`.
- Q8 inference is ~1.7× faster than the BF16 sibling on the same hardware
  while preserving 100% argmax agreement on the test set.

## Credits & Acknowledgements

This model wouldn't exist without two open-source releases — sincere
thanks to both teams:

- **OpenAI** for [open-sourcing the Privacy Filter](https://huggingface.co/openai/privacy-filter)
  (architecture, modeling code, and `opf` training/eval CLI). The 8-bit
  MLX port in this repo runs that same architecture under Apple's MLX
  framework with affine weight-only quantization.
- **NVIDIA** for releasing the [Nemotron-PII dataset](https://huggingface.co/datasets/nvidia/Nemotron-PII)
  used to fine-tune the source PyTorch checkpoint.

Additional thanks to **Apple** for [MLX](https://github.com/ml-explore/mlx)
and the **HuggingFace** team for the model-distribution ecosystem.

## License

Apache 2.0 (matches the source checkpoint).