docs: refresh README — openmed examples, credits, public release
Browse files
README.md
CHANGED
|
@@ -13,15 +13,16 @@ tags:
|
|
| 13 |
- redaction
|
| 14 |
- nemotron
|
| 15 |
- privacy-filter
|
|
|
|
| 16 |
language:
|
| 17 |
- en
|
| 18 |
---
|
| 19 |
|
| 20 |
# privacy-filter-nemotron
|
| 21 |
|
| 22 |
-
Fine-tuned [openai/privacy-filter](https://huggingface.co/openai/privacy-filter)
|
| 23 |
for **fine-grained PII extraction** across **55 categories** from
|
| 24 |
-
[nvidia/Nemotron-PII](https://huggingface.co/datasets/nvidia/Nemotron-PII).
|
| 25 |
|
| 26 |
- **Base model**: [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) — 1.4B-parameter MoE (50M active per token), BIOES token-classification head
|
| 27 |
- **Task**: Token classification for PII detection (BIOES scheme)
|
|
@@ -36,9 +37,63 @@ The base model ships with 8 coarse PII categories (`private_person`,
|
|
| 36 |
`credit_debit_card`, `ssn`, and so on — matching what downstream redaction
|
| 37 |
and masking pipelines typically need.
|
| 38 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
## Quick start
|
| 40 |
|
| 41 |
-
### With
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
```bash
|
| 44 |
pip install 'opf @ git+https://github.com/openai/privacy-filter.git'
|
|
@@ -48,7 +103,7 @@ opf redact \
|
|
| 48 |
--text "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, phone 415-555-0123."
|
| 49 |
```
|
| 50 |
|
| 51 |
-
### With `transformers`
|
| 52 |
|
| 53 |
```python
|
| 54 |
import torch
|
|
@@ -73,9 +128,9 @@ for t, l in zip(tokens, out):
|
|
| 73 |
print(f"{t}\t{id2label[l]}")
|
| 74 |
```
|
| 75 |
|
| 76 |
-
For best results use Viterbi decoding (not argmax) —
|
| 77 |
-
default. If you're doing argmax with the HF transformers API, you'll
|
| 78 |
-
slightly more boundary errors but still excellent label accuracy.
|
| 79 |
|
| 80 |
## Performance
|
| 81 |
|
|
@@ -208,6 +263,20 @@ observed.
|
|
| 208 |
- **Not a substitute for legal compliance review.** Use alongside a
|
| 209 |
governance layer (human review, deterministic regex pre-filters, etc.).
|
| 210 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 211 |
## License
|
| 212 |
|
| 213 |
Apache 2.0, same as the base model.
|
|
|
|
| 13 |
- redaction
|
| 14 |
- nemotron
|
| 15 |
- privacy-filter
|
| 16 |
+
- openmed
|
| 17 |
language:
|
| 18 |
- en
|
| 19 |
---
|
| 20 |
|
| 21 |
# privacy-filter-nemotron
|
| 22 |
|
| 23 |
+
Fine-tuned [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)
|
| 24 |
for **fine-grained PII extraction** across **55 categories** from
|
| 25 |
+
[`nvidia/Nemotron-PII`](https://huggingface.co/datasets/nvidia/Nemotron-PII).
|
| 26 |
|
| 27 |
- **Base model**: [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) — 1.4B-parameter MoE (50M active per token), BIOES token-classification head
|
| 28 |
- **Task**: Token classification for PII detection (BIOES scheme)
|
|
|
|
| 37 |
`credit_debit_card`, `ssn`, and so on — matching what downstream redaction
|
| 38 |
and masking pipelines typically need.
|
| 39 |
|
| 40 |
+
> **Family at a glance.** Same architecture, three runtimes:
|
| 41 |
+
> - **PyTorch (this repo)** — CPU + CUDA, anywhere transformers runs.
|
| 42 |
+
> - **MLX BF16** — [`OpenMed/privacy-filter-nemotron-mlx`](https://huggingface.co/OpenMed/privacy-filter-nemotron-mlx) — Apple Silicon, full precision.
|
| 43 |
+
> - **MLX 8-bit** — [`OpenMed/privacy-filter-nemotron-mlx-8bit`](https://huggingface.co/OpenMed/privacy-filter-nemotron-mlx-8bit) — Apple Silicon, ~1.7× faster.
|
| 44 |
+
|
| 45 |
## Quick start
|
| 46 |
|
| 47 |
+
### With [OpenMed](https://github.com/maziyarpanahi/openmed) — recommended
|
| 48 |
+
|
| 49 |
+
OpenMed gives you `extract_pii()` / `deidentify()` with built-in BIOES Viterbi
|
| 50 |
+
decoding, span refinement, and a Faker-backed obfuscation engine. Same call
|
| 51 |
+
on every host — Apple Silicon picks up MLX automatically; everywhere else uses
|
| 52 |
+
this PyTorch checkpoint.
|
| 53 |
+
|
| 54 |
+
```bash
|
| 55 |
+
pip install -U "openmed[hf]"
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
```python
|
| 59 |
+
from openmed import extract_pii, deidentify
|
| 60 |
+
|
| 61 |
+
text = (
|
| 62 |
+
"Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, "
|
| 63 |
+
"phone 415-555-0123, email sarah.johnson@example.com."
|
| 64 |
+
)
|
| 65 |
+
|
| 66 |
+
# Extract grouped entity spans
|
| 67 |
+
result = extract_pii(text, model_name="OpenMed/privacy-filter-nemotron")
|
| 68 |
+
for ent in result.entities:
|
| 69 |
+
print(f"{ent.label:30s} {ent.text!r} conf={ent.confidence:.2f}")
|
| 70 |
+
|
| 71 |
+
# De-identify with any of the supported methods
|
| 72 |
+
masked = deidentify(text, method="mask", model_name="OpenMed/privacy-filter-nemotron")
|
| 73 |
+
removed = deidentify(text, method="remove", model_name="OpenMed/privacy-filter-nemotron")
|
| 74 |
+
hashed = deidentify(text, method="hash", model_name="OpenMed/privacy-filter-nemotron")
|
| 75 |
+
|
| 76 |
+
# Faker-backed locale-aware obfuscation, deterministic with consistent=True+seed
|
| 77 |
+
fake = deidentify(
|
| 78 |
+
text,
|
| 79 |
+
method="replace",
|
| 80 |
+
model_name="OpenMed/privacy-filter-nemotron",
|
| 81 |
+
consistent=True,
|
| 82 |
+
seed=42,
|
| 83 |
+
)
|
| 84 |
+
print(fake.deidentified_text)
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
`OpenMed/privacy-filter-nemotron-mlx*` model names also work in the same
|
| 88 |
+
`extract_pii()` / `deidentify()` calls — on a non-Apple-Silicon host they
|
| 89 |
+
automatically fall back to **this PyTorch checkpoint** with a one-time
|
| 90 |
+
warning. So you can ship MLX names in code and still run on Linux/Windows.
|
| 91 |
+
|
| 92 |
+
The OpenMed wrapper passes `trust_remote_code=True` for you, runs the
|
| 93 |
+
model's own BIOES Viterbi decoder, and skips OpenMed's regex
|
| 94 |
+
smart-merging (the model already produces clean spans).
|
| 95 |
+
|
| 96 |
+
### With `opf` — OpenAI's official CLI
|
| 97 |
|
| 98 |
```bash
|
| 99 |
pip install 'opf @ git+https://github.com/openai/privacy-filter.git'
|
|
|
|
| 103 |
--text "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, phone 415-555-0123."
|
| 104 |
```
|
| 105 |
|
| 106 |
+
### With `transformers` directly
|
| 107 |
|
| 108 |
```python
|
| 109 |
import torch
|
|
|
|
| 128 |
print(f"{t}\t{id2label[l]}")
|
| 129 |
```
|
| 130 |
|
| 131 |
+
For best results use Viterbi decoding (not argmax) — both `opf` and OpenMed
|
| 132 |
+
do this by default. If you're doing argmax with the HF transformers API, you'll
|
| 133 |
+
see slightly more boundary errors but still excellent label accuracy.
|
| 134 |
|
| 135 |
## Performance
|
| 136 |
|
|
|
|
| 263 |
- **Not a substitute for legal compliance review.** Use alongside a
|
| 264 |
governance layer (human review, deterministic regex pre-filters, etc.).
|
| 265 |
|
| 266 |
+
## Credits & Acknowledgements
|
| 267 |
+
|
| 268 |
+
This model wouldn't exist without two open-source releases — sincere thanks
|
| 269 |
+
to both teams:
|
| 270 |
+
|
| 271 |
+
- **OpenAI** for [open-sourcing the Privacy Filter](https://huggingface.co/openai/privacy-filter)
|
| 272 |
+
(architecture, modeling code, and `opf` training/eval CLI). Everything in
|
| 273 |
+
this repo is a fine-tune on top of that release.
|
| 274 |
+
- **NVIDIA** for releasing the [Nemotron-PII dataset](https://huggingface.co/datasets/nvidia/Nemotron-PII)
|
| 275 |
+
with its 100K-row train split and 55 fine-grained PII labels.
|
| 276 |
+
|
| 277 |
+
Additional thanks to the **HuggingFace** team for the `transformers` /
|
| 278 |
+
`huggingface_hub` ecosystem this model ships through.
|
| 279 |
+
|
| 280 |
## License
|
| 281 |
|
| 282 |
Apache 2.0, same as the base model.
|