Update README.md

f6f3a63 verified 9 days ago

12.7 kB

	---
	license: apache-2.0
	library_name: transformers
	base_model: openai/privacy-filter
	datasets:
	- nvidia/Nemotron-PII
	pipeline_tag: token-classification
	tags:
	- token-classification
	- pii
	- ner
	- privacy
	- redaction
	- nemotron
	- privacy-filter
	- openmed
	language:
	- en
	---

	# privacy-filter-nemotron

	Fine-tuned [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)
	for fine-grained PII extraction across 55 categories from
	[`nvidia/Nemotron-PII`](https://huggingface.co/datasets/nvidia/Nemotron-PII).

	- Base model: [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) — 1.4B-parameter MoE (50M active per token), BIOES token-classification head
	- Task: Token classification for PII detection (BIOES scheme)
	- Training data: Full 100K rows of `nvidia/Nemotron-PII` train split
	- Held-out val: 10K label-stratified rows from the Nemotron `test` split (every label has ≥229 entities)
	- Recipe: `opf train` (OpenAI's official fine-tuning CLI) — full fine-tune, AdamW, lr=1e-4, 5 epochs, bf16, weight decay 0.0
	- Labels: 55 fine-grained PII categories → 221 BIOES classes (1 `O` + 55 × B/I/E/S)

	The base model ships with 8 coarse PII categories (`private_person`,
	`private_email`, etc.). This model trades that coarse vocabulary for a
	5× more granular one — `first_name`, `last_name`, `medical_record_number`,
	`credit_debit_card`, `ssn`, and so on — matching what downstream redaction
	and masking pipelines typically need.

	> Family at a glance. Same architecture, three runtimes:
	> - PyTorch (this repo) — CPU + CUDA, anywhere transformers runs.
	> - MLX BF16 — [`OpenMed/privacy-filter-nemotron-mlx`](https://huggingface.co/OpenMed/privacy-filter-nemotron-mlx) — Apple Silicon, full precision.
	> - MLX 8-bit — [`OpenMed/privacy-filter-nemotron-mlx-8bit`](https://huggingface.co/OpenMed/privacy-filter-nemotron-mlx-8bit) — Apple Silicon, ~1.7× faster.

	## Quick start

	### With [OpenMed](https://github.com/maziyarpanahi/openmed) — recommended

	OpenMed gives you `extract_pii()` / `deidentify()` with built-in BIOES Viterbi
	decoding, span refinement, and a Faker-backed obfuscation engine. Same call
	on every host — Apple Silicon picks up MLX automatically; everywhere else uses
	this PyTorch checkpoint.

	```bash
	pip install -U "openmed[hf]"
	```

	```python
	from openmed import extract_pii, deidentify

	text = (
	"Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, "
	"phone 415-555-0123, email sarah.johnson@example.com."
	)

	# Extract grouped entity spans
	result = extract_pii(text, model_name="OpenMed/privacy-filter-nemotron")
	for ent in result.entities:
	print(f"{ent.label:30s} {ent.text!r} conf={ent.confidence:.2f}")

	# De-identify with any of the supported methods
	masked = deidentify(text, method="mask", model_name="OpenMed/privacy-filter-nemotron")
	removed = deidentify(text, method="remove", model_name="OpenMed/privacy-filter-nemotron")
	hashed = deidentify(text, method="hash", model_name="OpenMed/privacy-filter-nemotron")

	# Faker-backed locale-aware obfuscation, deterministic with consistent=True+seed
	fake = deidentify(
	text,
	method="replace",
	model_name="OpenMed/privacy-filter-nemotron",
	consistent=True,
	seed=42,
	)
	print(fake.deidentified_text)
	```

	`OpenMed/privacy-filter-nemotron-mlx*` model names also work in the same
	`extract_pii()` / `deidentify()` calls — on a non-Apple-Silicon host they
	automatically fall back to this PyTorch checkpoint with a one-time
	warning. So you can ship MLX names in code and still run on Linux/Windows.

	The OpenMed wrapper passes `trust_remote_code=True` for you, runs the
	model's own BIOES Viterbi decoder, and skips OpenMed's regex
	smart-merging (the model already produces clean spans).

	### With `opf` — OpenAI's official CLI

	```bash
	pip install 'opf @ git+https://github.com/openai/privacy-filter.git'

	opf redact \
	--checkpoint OpenMed/privacy-filter-nemotron \
	--text "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, phone 415-555-0123."
	```

	### With `transformers` directly

	```python
	import torch
	from transformers import AutoModelForTokenClassification, AutoTokenizer

	model_id = "OpenMed/privacy-filter-nemotron"
	tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
	model = AutoModelForTokenClassification.from_pretrained(
	model_id, trust_remote_code=True, dtype=torch.bfloat16
	).to("cuda")
	model.eval()

	text = "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, phone 415-555-0123."
	enc = tok(text, return_tensors="pt").to("cuda")
	with torch.no_grad():
	out = model(**enc).logits.argmax(-1).cpu()[0].tolist()

	id2label = {int(k): v for k, v in model.config.id2label.items()}
	tokens = tok.convert_ids_to_tokens(enc["input_ids"][0].cpu().tolist())
	for t, l in zip(tokens, out):
	if l != 0:
	print(f"{t}\t{id2label[l]}")
	```

	For best results use Viterbi decoding (not argmax) — both `opf` and OpenMed
	do this by default. If you're doing argmax with the HF transformers API, you'll
	see slightly more boundary errors but still excellent label accuracy.

	## Performance

	Evaluated with `opf eval --decode-mode viterbi --eval-mode typed --span-metrics-space char`
	on the 10K label-stratified held-out val from `nvidia/Nemotron-PII:test`.

	### Headline

	\| Metric \| Value \|
	\|---\|---:\|
	\| Macro B-F1 (across 55 labels) \| 0.9533 \|
	\| Token accuracy \| 0.9910 \|
	\| Strong labels (F1 ≥ 0.90) \| 46 / 55 \|
	\| Acceptable (F1 0.70–0.89) \| 7 / 55 \|
	\| Weak (F1 < 0.70) \| 0 / 55 \|

	### Per-label F1 (B-tag, sorted)

	\| Label \| Precision \| Recall \| F1 \|
	\|---\|---:\|---:\|---:\|
	\| 🟢 `mac_address` \| 1.000 \| 1.000 \| 1.000 \|
	\| 🟢 `biometric_identifier` \| 0.999 \| 0.998 \| 0.999 \|
	\| 🟢 `bank_routing_number` \| 0.995 \| 0.999 \| 0.997 \|
	\| 🟢 `credit_debit_card` \| 0.999 \| 0.993 \| 0.996 \|
	\| 🟢 `ipv6` \| 0.992 \| 1.000 \| 0.996 \|
	\| 🟢 `health_plan_beneficiary_number` \| 1.000 \| 0.990 \| 0.995 \|
	\| 🟢 `coordinate` \| 0.994 \| 0.996 \| 0.995 \|
	\| 🟢 `ipv4` \| 0.993 \| 0.996 \| 0.994 \|
	\| 🟢 `url` \| 0.989 \| 0.999 \| 0.994 \|
	\| 🟢 `email` \| 0.994 \| 0.993 \| 0.994 \|
	\| 🟢 `date_of_birth` \| 0.992 \| 0.994 \| 0.993 \|
	\| 🟢 `medical_record_number` \| 0.997 \| 0.989 \| 0.993 \|
	\| 🟢 `street_address` \| 0.996 \| 0.989 \| 0.993 \|
	\| 🟢 `vehicle_identifier` \| 0.986 \| 0.996 \| 0.991 \|
	\| 🟢 `license_plate` \| 0.987 \| 0.993 \| 0.990 \|
	\| 🟢 `customer_id` \| 0.995 \| 0.984 \| 0.990 \|
	\| 🟢 `http_cookie` \| 0.992 \| 0.983 \| 0.988 \|
	\| 🟢 `employee_id` \| 0.987 \| 0.988 \| 0.988 \|
	\| 🟢 `account_number` \| 0.992 \| 0.982 \| 0.987 \|
	\| 🟢 `certificate_license_number` \| 0.989 \| 0.984 \| 0.987 \|
	\| 🟢 `swift_bic` \| 0.975 \| 0.998 \| 0.987 \|
	\| 🟢 `postcode` \| 0.991 \| 0.981 \| 0.986 \|
	\| 🟢 `api_key` \| 0.980 \| 0.990 \| 0.985 \|
	\| 🟢 `password` \| 0.999 \| 0.968 \| 0.983 \|
	\| 🟢 `tax_id` \| 1.000 \| 0.965 \| 0.982 \|
	\| 🟢 `device_identifier` \| 0.974 \| 0.988 \| 0.981 \|
	\| 🟢 `national_id` \| 0.991 \| 0.961 \| 0.976 \|
	\| 🟢 `last_name` \| 0.977 \| 0.975 \| 0.976 \|
	\| 🟢 `date_time` \| 0.982 \| 0.967 \| 0.974 \|
	\| 🟢 `first_name` \| 0.962 \| 0.978 \| 0.970 \|
	\| 🟢 `pin` \| 0.973 \| 0.967 \| 0.970 \|
	\| 🟢 `phone_number` \| 0.948 \| 0.992 \| 0.970 \|
	\| 🟢 `county` \| 0.986 \| 0.946 \| 0.965 \|
	\| 🟢 `employment_status` \| 0.960 \| 0.968 \| 0.964 \|
	\| 🟢 `user_name` \| 0.959 \| 0.964 \| 0.961 \|
	\| 🟢 `date` \| 0.967 \| 0.955 \| 0.961 \|
	\| 🟢 `blood_type` \| 0.922 \| 0.954 \| 0.938 \|
	\| 🟢 `country` \| 0.955 \| 0.918 \| 0.936 \|
	\| 🟢 `ssn` \| 0.926 \| 0.945 \| 0.935 \|
	\| 🟢 `education_level` \| 0.961 \| 0.908 \| 0.934 \|
	\| 🟢 `sexuality` \| 0.908 \| 0.956 \| 0.931 \|
	\| 🟢 `company_name` \| 0.967 \| 0.894 \| 0.929 \|
	\| 🟢 `religious_belief` \| 0.912 \| 0.941 \| 0.926 \|
	\| 🟢 `unique_id` \| 0.910 \| 0.922 \| 0.916 \|
	\| 🟢 `political_view` \| 0.939 \| 0.872 \| 0.905 \|
	\| 🟢 `fax_number` \| 0.978 \| 0.841 \| 0.904 \|
	\| 🟡 `city` \| 0.917 \| 0.876 \| 0.896 \|
	\| 🟡 `time` \| 0.933 \| 0.802 \| 0.863 \|
	\| 🟡 `race_ethnicity` \| 0.821 \| 0.906 \| 0.861 \|
	\| 🟡 `gender` \| 0.967 \| 0.744 \| 0.841 \|
	\| 🟡 `state` \| 0.878 \| 0.785 \| 0.829 \|
	\| 🟡 `language` \| 0.889 \| 0.735 \| 0.804 \|
	\| 🟡 `occupation` \| 0.799 \| 0.667 \| 0.727 \|

	## Label space (55 categories)

	\| Category \| Typical examples \|
	\|---\|---\|
	\| Identity \| `first_name`, `last_name`, `user_name`, `age`, `gender`, `race_ethnicity`, `sexuality`, `religious_belief`, `political_view`, `marital_status`, `nationality`, `education_level`, `occupation`, `employment_status`, `language`, `blood_type`, `biometric_identifier` \|
	\| Contact \| `email`, `phone_number`, `fax_number`, `url` \|
	\| Address \| `street_address`, `city`, `county`, `state`, `country`, `postcode`, `coordinate` \|
	\| Dates \| `date`, `date_of_birth`, `date_time`, `time` \|
	\| Government IDs \| `ssn`, `national_id`, `tax_id` \|
	\| Financial \| `account_number`, `bank_routing_number`, `swift_bic`, `credit_debit_card`, `cvv`, `pin`, `password` \|
	\| Healthcare \| `medical_record_number`, `health_plan_beneficiary_number` \|
	\| Enterprise IDs \| `customer_id`, `employee_id`, `unique_id`, `certificate_license_number` \|
	\| Vehicle \| `license_plate`, `vehicle_identifier` \|
	\| Digital \| `ipv4`, `ipv6`, `mac_address`, `device_identifier`, `api_key`, `http_cookie` \|


	Head initialization: `opf`'s default "copy-from-matching-base" head init.
	Of the 221 new BIOES classes, 5 had exact matches in the base
	(`O`, `B/I/E/S-account_number`); the other 216 were copied from
	semantically-adjacent coarse rows and fine-tuned end-to-end.

	Router: base model has 128 MoE experts per layer with top-4 routing.
	Routers were kept trainable during full fine-tuning; no collapse was
	observed.

	## Limitations & intended use

	- English-only training data. Nemotron-PII is predominantly English
	with a 50/50 US/international locale split. Performance on non-English
	text is not guaranteed.
	- **`occupation`, `language`, `gender`, `state`, `race_ethnicity`,
	`political_view`, `education_level` are fuzzier categories** than the
	strict identifiers — F1 lands in 0.65–0.89 vs 0.95+ for formatted
	identifiers. If your downstream only cares about strict PII, you can
	ignore low-confidence predictions on these.
	- Synthetic training data. Nemotron-PII is a synthesized dataset; real
	clinical notes, legal documents, and web text may show different
	surface forms. For high-stakes deployments, collect a domain-specific
	eval set and re-calibrate thresholds.
	- Not a substitute for legal compliance review. Use alongside a
	governance layer (human review, deterministic regex pre-filters, etc.).

	## Credits & Acknowledgements

	This model wouldn't exist without two open-source releases — sincere thanks
	to both teams:

	- OpenAI for [open-sourcing the Privacy Filter](https://huggingface.co/openai/privacy-filter)
	(architecture, modeling code, and `opf` training/eval CLI). Everything in
	this repo is a fine-tune on top of that release.
	- NVIDIA for releasing the [Nemotron-PII dataset](https://huggingface.co/datasets/nvidia/Nemotron-PII)
	with its 100K-row train split and 55 fine-grained PII labels.

	Additional thanks to the HuggingFace team for the `transformers` /
	`huggingface_hub` ecosystem this model ships through.

	## License

	Apache 2.0, same as the base model.

	## Citation

	If you use this model, please cite this model, the organization behind
	it (OpenMed), and the upstream base model + dataset:

	```bibtex
	@misc{openmed_privacy_filter_nemotron_2026,
	author = {OpenMed},
	title = {{OpenMed/privacy-filter-nemotron}: fine-grained PII extraction with 55 categories},
	year = {2026},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/OpenMed/privacy-filter-nemotron}}
	}

	@misc{openmed_2026,
	author = {OpenMed},
	title = {{OpenMed}: open models and resources for healthcare NLP},
	year = {2026},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/OpenMed}}
	}

	@misc{openai_privacy_filter_2025,
	author = {OpenAI},
	title = {{openai/privacy-filter}},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/openai/privacy-filter}}
	}

	@misc{nemotron_pii_2025,
	author = {NVIDIA},
	title = {{Nemotron-PII}},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/datasets/nvidia/Nemotron-PII}}
	}
	```