OpenMed
/

privacy-filter-nemotron

@@ -13,15 +13,16 @@ tags:
   - redaction
   - nemotron
   - privacy-filter
 language:
   - en
 ---
 # privacy-filter-nemotron
-Fine-tuned [openai/privacy-filter](https://huggingface.co/openai/privacy-filter)
 for **fine-grained PII extraction** across **55 categories** from
-[nvidia/Nemotron-PII](https://huggingface.co/datasets/nvidia/Nemotron-PII).
 - **Base model**: [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) — 1.4B-parameter MoE (50M active per token), BIOES token-classification head
 - **Task**: Token classification for PII detection (BIOES scheme)
@@ -36,9 +37,63 @@ The base model ships with 8 coarse PII categories (`private_person`,
 `credit_debit_card`, `ssn`, and so on — matching what downstream redaction
 and masking pipelines typically need.
 ## Quick start
-### With `opf` (official CLI from OpenAI, recommended)
 ```bash
 pip install 'opf @ git+https://github.com/openai/privacy-filter.git'
@@ -48,7 +103,7 @@ opf redact \
   --text "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, phone 415-555-0123."
 ```
-### With `transformers`
 ```python
 import torch
@@ -73,9 +128,9 @@ for t, l in zip(tokens, out):
         print(f"{t}\t{id2label[l]}")
 ```
-For best results use Viterbi decoding (not argmax) — the `opf` CLI does this by
-default. If you're doing argmax with the HF transformers API, you'll see
-slightly more boundary errors but still excellent label accuracy.
 ## Performance
@@ -208,6 +263,20 @@ observed.
 - **Not a substitute for legal compliance review.** Use alongside a
   governance layer (human review, deterministic regex pre-filters, etc.).
 ## License
 Apache 2.0, same as the base model.

   - redaction
   - nemotron
   - privacy-filter
+  - openmed
 language:
   - en
 ---
 # privacy-filter-nemotron
+Fine-tuned [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)
 for **fine-grained PII extraction** across **55 categories** from
+[`nvidia/Nemotron-PII`](https://huggingface.co/datasets/nvidia/Nemotron-PII).
 - **Base model**: [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) — 1.4B-parameter MoE (50M active per token), BIOES token-classification head
 - **Task**: Token classification for PII detection (BIOES scheme)
 `credit_debit_card`, `ssn`, and so on — matching what downstream redaction
 and masking pipelines typically need.
+> **Family at a glance.** Same architecture, three runtimes:
+> - **PyTorch (this repo)** — CPU + CUDA, anywhere transformers runs.
+> - **MLX BF16** — [`OpenMed/privacy-filter-nemotron-mlx`](https://huggingface.co/OpenMed/privacy-filter-nemotron-mlx) — Apple Silicon, full precision.
+> - **MLX 8-bit** — [`OpenMed/privacy-filter-nemotron-mlx-8bit`](https://huggingface.co/OpenMed/privacy-filter-nemotron-mlx-8bit) — Apple Silicon, ~1.7× faster.
 ## Quick start
+### With [OpenMed](https://github.com/maziyarpanahi/openmed) — recommended
+OpenMed gives you `extract_pii()` / `deidentify()` with built-in BIOES Viterbi
+decoding, span refinement, and a Faker-backed obfuscation engine. Same call
+on every host — Apple Silicon picks up MLX automatically; everywhere else uses
+this PyTorch checkpoint.
+```bash
+pip install -U "openmed[hf]"
+```
+```python
+from openmed import extract_pii, deidentify
+text = (
+    "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, "
+    "phone 415-555-0123, email sarah.johnson@example.com."
+)
+# Extract grouped entity spans
+result = extract_pii(text, model_name="OpenMed/privacy-filter-nemotron")
+for ent in result.entities:
+    print(f"{ent.label:30s} {ent.text!r}  conf={ent.confidence:.2f}")
+# De-identify with any of the supported methods
+masked   = deidentify(text, method="mask",   model_name="OpenMed/privacy-filter-nemotron")
+removed  = deidentify(text, method="remove", model_name="OpenMed/privacy-filter-nemotron")
+hashed   = deidentify(text, method="hash",   model_name="OpenMed/privacy-filter-nemotron")
+# Faker-backed locale-aware obfuscation, deterministic with consistent=True+seed
+fake = deidentify(
+    text,
+    method="replace",
+    model_name="OpenMed/privacy-filter-nemotron",
+    consistent=True,
+    seed=42,
+)
+print(fake.deidentified_text)
+```
+`OpenMed/privacy-filter-nemotron-mlx*` model names also work in the same
+`extract_pii()` / `deidentify()` calls — on a non-Apple-Silicon host they
+automatically fall back to **this PyTorch checkpoint** with a one-time
+warning. So you can ship MLX names in code and still run on Linux/Windows.
+The OpenMed wrapper passes `trust_remote_code=True` for you, runs the
+model's own BIOES Viterbi decoder, and skips OpenMed's regex
+smart-merging (the model already produces clean spans).
+### With `opf` — OpenAI's official CLI
 ```bash
 pip install 'opf @ git+https://github.com/openai/privacy-filter.git'
   --text "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, phone 415-555-0123."
 ```
+### With `transformers` directly
 ```python
 import torch
         print(f"{t}\t{id2label[l]}")
 ```
+For best results use Viterbi decoding (not argmax) — both `opf` and OpenMed
+do this by default. If you're doing argmax with the HF transformers API, you'll
+see slightly more boundary errors but still excellent label accuracy.
 ## Performance
 - **Not a substitute for legal compliance review.** Use alongside a
   governance layer (human review, deterministic regex pre-filters, etc.).
+## Credits & Acknowledgements
+This model wouldn't exist without two open-source releases — sincere thanks
+to both teams:
+- **OpenAI** for [open-sourcing the Privacy Filter](https://huggingface.co/openai/privacy-filter)
+  (architecture, modeling code, and `opf` training/eval CLI). Everything in
+  this repo is a fine-tune on top of that release.
+- **NVIDIA** for releasing the [Nemotron-PII dataset](https://huggingface.co/datasets/nvidia/Nemotron-PII)
+  with its 100K-row train split and 55 fine-grained PII labels.
+Additional thanks to the **HuggingFace** team for the `transformers` /
+`huggingface_hub` ecosystem this model ships through.
 ## License
 Apache 2.0, same as the base model.