docs: refresh README — openmed examples, credits, public release

32b4bb1 verified 10 days ago

9.19 kB

	---
	license: apache-2.0
	base_model: OpenMed/privacy-filter-nemotron
	datasets:
	- nvidia/Nemotron-PII
	pipeline_tag: token-classification
	library_name: openmed
	tags:
	- openmed
	- mlx
	- apple-silicon
	- token-classification
	- pii
	- de-identification
	- medical
	- clinical
	- privacy-filter
	- nemotron
	- quantized
	- 8bit
	language:
	- en
	---

	# OpenMed Privacy Filter (Nemotron) — MLX 8-bit

	A native [MLX](https://github.com/ml-explore/mlx) port of
	[`OpenMed/privacy-filter-nemotron`](https://huggingface.co/OpenMed/privacy-filter-nemotron),
	affine-quantized to 8-bit for fast on-device PII detection on Apple
	Silicon. For the unquantized BF16 reference, see
	[`OpenMed/privacy-filter-nemotron-mlx`](https://huggingface.co/OpenMed/privacy-filter-nemotron-mlx).

	> Family at a glance. Same architecture and training data, three runtimes:
	> - PyTorch — [`OpenMed/privacy-filter-nemotron`](https://huggingface.co/OpenMed/privacy-filter-nemotron) — CPU + CUDA.
	> - MLX BF16 — [`OpenMed/privacy-filter-nemotron-mlx`](https://huggingface.co/OpenMed/privacy-filter-nemotron-mlx) — Apple Silicon, full precision (~2.6 GB).
	> - MLX 8-bit (this repo) — Apple Silicon, ~1.4 GB, ~1.7× faster than BF16.

	## Why 8-bit?

	\| \| BF16 sibling \| This repo (Q8) \|
	\| --- \| --- \| --- \|
	\| `weights.safetensors` size \| 2.6 GB \| 1.4 GB (-47%) \|
	\| Forward pass (10-token PII sample) \| ~14 ms \| ~8 ms (~1.7× faster) \|
	\| Argmax agreement vs. BF16 \| (reference) \| 100% on every test sample \|
	\| Entity-group preservation \| (reference) \| identical on every test sample \|

	Numbers above are from `scripts/export/verify_privacy_filter_nemotron_mlx.py`
	over 10 golden PII samples (email, phone, ssn, credit card, name, ipv4,
	address, date_of_birth, url, mixed). Q8 with `group_size=64` was validated
	against BF16; argmax matched on 100% of tokens, all entity-group sets
	matched exactly.

	## What it does

	The model is a token classifier built on OpenAI's open Privacy Filter
	architecture (the same `openai_privacy_filter` model type used by
	[`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)).
	It tags each token with a BIOES label across 55 PII span classes, then
	a Viterbi pass over the BIOES grammar yields clean entity spans. Detected
	categories include:

	- Personal identifiers — `first_name`, `last_name`, `user_name`, `gender`, `age`, `date_of_birth`
	- Contact — `email`, `phone_number`, `fax_number`, `street_address`, `city`, `state`, `country`, `county`, `postcode`, `coordinate`
	- Government / legal IDs — `ssn`, `national_id`, `tax_id`, `certificate_license_number`
	- Financial — `account_number`, `bank_routing_number`, `credit_debit_card`, `cvv`, `pin`, `swift_bic`
	- Medical — `medical_record_number`, `health_plan_beneficiary_number`, `blood_type`
	- Workplace — `company_name`, `occupation`, `employee_id`, `customer_id`, `employment_status`, `education_level`
	- Online — `url`, `ipv4`, `ipv6`, `mac_address`, `http_cookie`, `api_key`, `password`, `device_identifier`
	- Demographic — `race_ethnicity`, `religious_belief`, `political_view`, `sexuality`, `language`
	- Vehicles — `license_plate`, `vehicle_identifier`
	- Time — `date`, `date_time`, `time`
	- Misc — `biometric_identifier`, `unique_id`

	<details>
	<summary>Full label schema (221 labels)</summary>

	The output space is `O` plus `B-`, `I-`, `E-`, `S-` for each of the 55
	span classes (4 × 55 + 1 = 221). The runtime `PrivacyFilterMLXPipeline`
	runs Viterbi over this BIOES grammar, so the consumer sees clean grouped
	entities rather than raw token tags.

	The full `id2label.json` is shipped alongside the weights in this repo.
	</details>

	For per-label accuracy, training recipe, and dataset details, see the
	[base PyTorch checkpoint](https://huggingface.co/OpenMed/privacy-filter-nemotron).

	## Architecture

	\| Field \| Value \|
	\| --- \| --- \|
	\| Source model type \| `openai_privacy_filter` \|
	\| Source architecture \| `OpenAIPrivacyFilterForTokenClassification` \|
	\| Hidden size \| 640 \|
	\| Transformer layers \| 8 \|
	\| Attention \| Grouped-Query (14 query heads / 2 KV heads, head_dim=64) with attention sinks \|
	\| FFN \| Sparse Mixture-of-Experts — 128 experts, top-4 routing, SwiGLU \|
	\| Position encoding \| YARN-scaled RoPE (`rope_theta=150_000`, factor=32) \|
	\| Context length \| 131,072 tokens (initial 4,096) \|
	\| Tokenizer \| `o200k_base` (tiktoken) — vocab 200,064 \|
	\| Output head \| Linear(640 → 221) with bias \|

	## Quantization

	\| Field \| Value \|
	\| --- \| --- \|
	\| Bits \| 8 \|
	\| Group size \| 64 \|
	\| Mode \| affine (MLX `mx.quantize`, weight-only) \|
	\| Quantized modules \| `embedding`, attention `qkv` & `out`, MoE `gate`, expert `swiglu` & `out`, `unembedding` \|
	\| Non-quantized modules \| RMSNorms, attention sinks (kept in BF16) \|

	Expert tensors are stored in MLX's packed transposed layout and run through
	`mx.gather_qmm` at inference time. RMSNorm scales and attention sinks
	remain BF16 because their parameter count is negligible relative to the
	rest of the model.

	## File set

	\| File \| Size \| Purpose \|
	\| --- \| --- \| --- \|
	\| `weights.safetensors` \| 1.4 GB \| Q8 packed weights + scales/biases (uint32 packed for quantized modules, BF16 for norms/sinks) \|
	\| `config.json` \| 20 KB \| Model + MLX runtime config (with `_mlx_quantization` block) \|
	\| `id2label.json` \| 5.4 KB \| Numeric ID → BIOES label string \|
	\| `openmed-mlx.json` \| 0.8 KB \| OpenMed MLX manifest with `quantization: {bits: 8, group_size: 64, mode: affine}` \|
	\| `tokenizer.json`, `tokenizer_config.json` \| 27 MB \| Source tokenizer files (kept for reference) \|

	The MLX runtime uses `tiktoken` `o200k_base` directly for tokenization;
	the `tokenizer.json` is kept so consumers can inspect or re-tokenize via
	`transformers` if desired.

	## Quick start

	### With [OpenMed](https://github.com/maziyarpanahi/openmed) — recommended

	OpenMed gives you a single `extract_pii()` / `deidentify()` API that
	auto-selects MLX on Apple Silicon and PyTorch elsewhere — same code on
	every host.

	```bash
	pip install -U "openmed[mlx]"
	```

	```python
	from openmed import extract_pii, deidentify

	text = (
	"Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, "
	"phone 415-555-0123, email sarah.johnson@example.com."
	)

	# Extract grouped entity spans (runs on MLX 8-bit here, PyTorch fallback elsewhere)
	result = extract_pii(text, model_name="OpenMed/privacy-filter-nemotron-mlx-8bit")
	for ent in result.entities:
	print(f"{ent.label:30s} {ent.text!r} conf={ent.confidence:.2f}")

	# De-identify
	masked = deidentify(text, method="mask",
	model_name="OpenMed/privacy-filter-nemotron-mlx-8bit")
	fake = deidentify(
	text,
	method="replace",
	model_name="OpenMed/privacy-filter-nemotron-mlx-8bit",
	consistent=True,
	seed=42, # deterministic locale-aware Faker surrogates
	)
	```

	When MLX isn't available (Linux, Windows, Intel Mac, missing `mlx` package),
	this exact same call automatically falls back to the PyTorch checkpoint
	[`OpenMed/privacy-filter-nemotron`](https://huggingface.co/OpenMed/privacy-filter-nemotron)
	with a one-time warning. Family-aware fallback: a Nemotron MLX request never
	substitutes the unrelated `openai/privacy-filter` baseline.

	### Direct MLX usage (lower-level)

	```python
	from huggingface_hub import snapshot_download
	from openmed.mlx.inference import PrivacyFilterMLXPipeline

	model_path = snapshot_download("OpenMed/privacy-filter-nemotron-mlx-8bit")
	pipe = PrivacyFilterMLXPipeline(model_path)

	print(pipe("Email me at alice.smith@example.com after 5pm."))
	# [{'entity_group': 'email',
	# 'score': 0.92,
	# 'word': 'alice.smith@example.com',
	# 'start': 12,
	# 'end': 35}]
	```

	The pipeline returns a list of dicts with `entity_group`, `score`, `word`,
	`start`, and `end` (character offsets into the input string).

	### Loading from a local snapshot

	```python
	from openmed.mlx.models import load_model
	import mlx.core as mx

	model = load_model("/path/to/privacy-filter-nemotron-mlx-8bit")
	ids = mx.array([[1, 100, 200, 300]], dtype=mx.int32)
	mask = mx.ones((1, 4), dtype=mx.bool_)
	logits = model(ids, attention_mask=mask) # shape (1, 4, 221)
	```

	## Hardware notes

	- Designed for Apple Silicon (M-series GPUs); CPU inference works but is slower.
	- Tested on macOS with `mlx>=0.18`.
	- Q8 inference is ~1.7× faster than the BF16 sibling on the same hardware
	while preserving 100% argmax agreement on the test set.

	## Credits & Acknowledgements

	This model wouldn't exist without two open-source releases — sincere
	thanks to both teams:

	- OpenAI for [open-sourcing the Privacy Filter](https://huggingface.co/openai/privacy-filter)
	(architecture, modeling code, and `opf` training/eval CLI). The 8-bit
	MLX port in this repo runs that same architecture under Apple's MLX
	framework with affine weight-only quantization.
	- NVIDIA for releasing the [Nemotron-PII dataset](https://huggingface.co/datasets/nvidia/Nemotron-PII)
	used to fine-tune the source PyTorch checkpoint.

	Additional thanks to Apple for [MLX](https://github.com/ml-explore/mlx)
	and the HuggingFace team for the model-distribution ecosystem.

	## License

	Apache 2.0 (matches the source checkpoint).