| --- |
| license: apache-2.0 |
| base_model: OpenMed/privacy-filter-nemotron |
| datasets: |
| - nvidia/Nemotron-PII |
| pipeline_tag: token-classification |
| library_name: openmed |
| tags: |
| - openmed |
| - mlx |
| - apple-silicon |
| - token-classification |
| - pii |
| - de-identification |
| - medical |
| - clinical |
| - privacy-filter |
| - nemotron |
| - quantized |
| - 8bit |
| language: |
| - en |
| --- |
| |
| # OpenMed Privacy Filter (Nemotron) β MLX 8-bit |
|
|
| A native [MLX](https://github.com/ml-explore/mlx) port of |
| [`OpenMed/privacy-filter-nemotron`](https://huggingface.co/OpenMed/privacy-filter-nemotron), |
| affine-quantized to **8-bit** for fast on-device PII detection on Apple |
| Silicon. For the unquantized BF16 reference, see |
| [`OpenMed/privacy-filter-nemotron-mlx`](https://huggingface.co/OpenMed/privacy-filter-nemotron-mlx). |
|
|
| > **Family at a glance.** Same architecture and training data, three runtimes: |
| > - **PyTorch** β [`OpenMed/privacy-filter-nemotron`](https://huggingface.co/OpenMed/privacy-filter-nemotron) β CPU + CUDA. |
| > - **MLX BF16** β [`OpenMed/privacy-filter-nemotron-mlx`](https://huggingface.co/OpenMed/privacy-filter-nemotron-mlx) β Apple Silicon, full precision (~2.6 GB). |
| > - **MLX 8-bit (this repo)** β Apple Silicon, ~1.4 GB, ~1.7Γ faster than BF16. |
|
|
| ## Why 8-bit? |
|
|
| | | BF16 sibling | This repo (Q8) | |
| | --- | --- | --- | |
| | `weights.safetensors` size | **2.6 GB** | **1.4 GB** (-47%) | |
| | Forward pass (10-token PII sample) | ~14 ms | ~8 ms (~1.7Γ faster) | |
| | Argmax agreement vs. BF16 | (reference) | **100%** on every test sample | |
| | Entity-group preservation | (reference) | **identical** on every test sample | |
|
|
| Numbers above are from `scripts/export/verify_privacy_filter_nemotron_mlx.py` |
| over 10 golden PII samples (email, phone, ssn, credit card, name, ipv4, |
| address, date_of_birth, url, mixed). Q8 with `group_size=64` was validated |
| against BF16; argmax matched on 100% of tokens, all entity-group sets |
| matched exactly. |
|
|
| ## What it does |
|
|
| The model is a token classifier built on OpenAI's open Privacy Filter |
| architecture (the same `openai_privacy_filter` model type used by |
| [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)). |
| It tags each token with a BIOES label across **55 PII span classes**, then |
| a Viterbi pass over the BIOES grammar yields clean entity spans. Detected |
| categories include: |
|
|
| - Personal identifiers β `first_name`, `last_name`, `user_name`, `gender`, `age`, `date_of_birth` |
| - Contact β `email`, `phone_number`, `fax_number`, `street_address`, `city`, `state`, `country`, `county`, `postcode`, `coordinate` |
| - Government / legal IDs β `ssn`, `national_id`, `tax_id`, `certificate_license_number` |
| - Financial β `account_number`, `bank_routing_number`, `credit_debit_card`, `cvv`, `pin`, `swift_bic` |
| - Medical β `medical_record_number`, `health_plan_beneficiary_number`, `blood_type` |
| - Workplace β `company_name`, `occupation`, `employee_id`, `customer_id`, `employment_status`, `education_level` |
| - Online β `url`, `ipv4`, `ipv6`, `mac_address`, `http_cookie`, `api_key`, `password`, `device_identifier` |
| - Demographic β `race_ethnicity`, `religious_belief`, `political_view`, `sexuality`, `language` |
| - Vehicles β `license_plate`, `vehicle_identifier` |
| - Time β `date`, `date_time`, `time` |
| - Misc β `biometric_identifier`, `unique_id` |
|
|
| <details> |
| <summary>Full label schema (221 labels)</summary> |
|
|
| The output space is `O` plus `B-`, `I-`, `E-`, `S-` for each of the 55 |
| span classes (4 Γ 55 + 1 = 221). The runtime `PrivacyFilterMLXPipeline` |
| runs Viterbi over this BIOES grammar, so the consumer sees clean grouped |
| entities rather than raw token tags. |
|
|
| The full `id2label.json` is shipped alongside the weights in this repo. |
| </details> |
|
|
| For per-label accuracy, training recipe, and dataset details, see the |
| [base PyTorch checkpoint](https://huggingface.co/OpenMed/privacy-filter-nemotron). |
|
|
| ## Architecture |
|
|
| | Field | Value | |
| | --- | --- | |
| | Source model type | `openai_privacy_filter` | |
| | Source architecture | `OpenAIPrivacyFilterForTokenClassification` | |
| | Hidden size | 640 | |
| | Transformer layers | 8 | |
| | Attention | Grouped-Query (14 query heads / 2 KV heads, head_dim=64) with attention sinks | |
| | FFN | Sparse Mixture-of-Experts β 128 experts, top-4 routing, SwiGLU | |
| | Position encoding | YARN-scaled RoPE (`rope_theta=150_000`, factor=32) | |
| | Context length | 131,072 tokens (initial 4,096) | |
| | Tokenizer | `o200k_base` (tiktoken) β vocab 200,064 | |
| | Output head | Linear(640 β 221) with bias | |
|
|
| ## Quantization |
|
|
| | Field | Value | |
| | --- | --- | |
| | Bits | **8** | |
| | Group size | **64** | |
| | Mode | **affine** (MLX `mx.quantize`, weight-only) | |
| | Quantized modules | `embedding`, attention `qkv` & `out`, MoE `gate`, expert `swiglu` & `out`, `unembedding` | |
| | Non-quantized modules | RMSNorms, attention sinks (kept in BF16) | |
|
|
| Expert tensors are stored in MLX's packed transposed layout and run through |
| `mx.gather_qmm` at inference time. RMSNorm scales and attention sinks |
| remain BF16 because their parameter count is negligible relative to the |
| rest of the model. |
|
|
| ## File set |
|
|
| | File | Size | Purpose | |
| | --- | --- | --- | |
| | `weights.safetensors` | 1.4 GB | Q8 packed weights + scales/biases (uint32 packed for quantized modules, BF16 for norms/sinks) | |
| | `config.json` | 20 KB | Model + MLX runtime config (with `_mlx_quantization` block) | |
| | `id2label.json` | 5.4 KB | Numeric ID β BIOES label string | |
| | `openmed-mlx.json` | 0.8 KB | OpenMed MLX manifest with `quantization: {bits: 8, group_size: 64, mode: affine}` | |
| | `tokenizer.json`, `tokenizer_config.json` | 27 MB | Source tokenizer files (kept for reference) | |
|
|
| The MLX runtime uses `tiktoken` `o200k_base` directly for tokenization; |
| the `tokenizer.json` is kept so consumers can inspect or re-tokenize via |
| `transformers` if desired. |
|
|
| ## Quick start |
|
|
| ### With [OpenMed](https://github.com/maziyarpanahi/openmed) β recommended |
|
|
| OpenMed gives you a single `extract_pii()` / `deidentify()` API that |
| auto-selects MLX on Apple Silicon and PyTorch elsewhere β same code on |
| every host. |
|
|
| ```bash |
| pip install -U "openmed[mlx]" |
| ``` |
|
|
| ```python |
| from openmed import extract_pii, deidentify |
| |
| text = ( |
| "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, " |
| "phone 415-555-0123, email sarah.johnson@example.com." |
| ) |
| |
| # Extract grouped entity spans (runs on MLX 8-bit here, PyTorch fallback elsewhere) |
| result = extract_pii(text, model_name="OpenMed/privacy-filter-nemotron-mlx-8bit") |
| for ent in result.entities: |
| print(f"{ent.label:30s} {ent.text!r} conf={ent.confidence:.2f}") |
| |
| # De-identify |
| masked = deidentify(text, method="mask", |
| model_name="OpenMed/privacy-filter-nemotron-mlx-8bit") |
| fake = deidentify( |
| text, |
| method="replace", |
| model_name="OpenMed/privacy-filter-nemotron-mlx-8bit", |
| consistent=True, |
| seed=42, # deterministic locale-aware Faker surrogates |
| ) |
| ``` |
|
|
| When MLX isn't available (Linux, Windows, Intel Mac, missing `mlx` package), |
| this exact same call automatically falls back to the PyTorch checkpoint |
| [`OpenMed/privacy-filter-nemotron`](https://huggingface.co/OpenMed/privacy-filter-nemotron) |
| with a one-time warning. Family-aware fallback: a Nemotron MLX request never |
| substitutes the unrelated `openai/privacy-filter` baseline. |
|
|
| ### Direct MLX usage (lower-level) |
|
|
| ```python |
| from huggingface_hub import snapshot_download |
| from openmed.mlx.inference import PrivacyFilterMLXPipeline |
| |
| model_path = snapshot_download("OpenMed/privacy-filter-nemotron-mlx-8bit") |
| pipe = PrivacyFilterMLXPipeline(model_path) |
| |
| print(pipe("Email me at alice.smith@example.com after 5pm.")) |
| # [{'entity_group': 'email', |
| # 'score': 0.92, |
| # 'word': 'alice.smith@example.com', |
| # 'start': 12, |
| # 'end': 35}] |
| ``` |
|
|
| The pipeline returns a list of dicts with `entity_group`, `score`, `word`, |
| `start`, and `end` (character offsets into the input string). |
|
|
| ### Loading from a local snapshot |
|
|
| ```python |
| from openmed.mlx.models import load_model |
| import mlx.core as mx |
| |
| model = load_model("/path/to/privacy-filter-nemotron-mlx-8bit") |
| ids = mx.array([[1, 100, 200, 300]], dtype=mx.int32) |
| mask = mx.ones((1, 4), dtype=mx.bool_) |
| logits = model(ids, attention_mask=mask) # shape (1, 4, 221) |
| ``` |
|
|
| ## Hardware notes |
|
|
| - Designed for Apple Silicon (M-series GPUs); CPU inference works but is slower. |
| - Tested on macOS with `mlx>=0.18`. |
| - Q8 inference is ~1.7Γ faster than the BF16 sibling on the same hardware |
| while preserving 100% argmax agreement on the test set. |
|
|
| ## Credits & Acknowledgements |
|
|
| This model wouldn't exist without two open-source releases β sincere |
| thanks to both teams: |
|
|
| - **OpenAI** for [open-sourcing the Privacy Filter](https://huggingface.co/openai/privacy-filter) |
| (architecture, modeling code, and `opf` training/eval CLI). The 8-bit |
| MLX port in this repo runs that same architecture under Apple's MLX |
| framework with affine weight-only quantization. |
| - **NVIDIA** for releasing the [Nemotron-PII dataset](https://huggingface.co/datasets/nvidia/Nemotron-PII) |
| used to fine-tune the source PyTorch checkpoint. |
|
|
| Additional thanks to **Apple** for [MLX](https://github.com/ml-explore/mlx) |
| and the **HuggingFace** team for the model-distribution ecosystem. |
|
|
| ## License |
|
|
| Apache 2.0 (matches the source checkpoint). |
|
|