MaziyarPanahi commited on
Commit
d4a9d9d
·
verified ·
1 Parent(s): 8b9cf96

docs: refresh README — openmed examples, credits, public release

Browse files
Files changed (1) hide show
  1. README.md +76 -7
README.md CHANGED
@@ -13,15 +13,16 @@ tags:
13
  - redaction
14
  - nemotron
15
  - privacy-filter
 
16
  language:
17
  - en
18
  ---
19
 
20
  # privacy-filter-nemotron
21
 
22
- Fine-tuned [openai/privacy-filter](https://huggingface.co/openai/privacy-filter)
23
  for **fine-grained PII extraction** across **55 categories** from
24
- [nvidia/Nemotron-PII](https://huggingface.co/datasets/nvidia/Nemotron-PII).
25
 
26
  - **Base model**: [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) — 1.4B-parameter MoE (50M active per token), BIOES token-classification head
27
  - **Task**: Token classification for PII detection (BIOES scheme)
@@ -36,9 +37,63 @@ The base model ships with 8 coarse PII categories (`private_person`,
36
  `credit_debit_card`, `ssn`, and so on — matching what downstream redaction
37
  and masking pipelines typically need.
38
 
 
 
 
 
 
39
  ## Quick start
40
 
41
- ### With `opf` (official CLI from OpenAI, recommended)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
  ```bash
44
  pip install 'opf @ git+https://github.com/openai/privacy-filter.git'
@@ -48,7 +103,7 @@ opf redact \
48
  --text "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, phone 415-555-0123."
49
  ```
50
 
51
- ### With `transformers`
52
 
53
  ```python
54
  import torch
@@ -73,9 +128,9 @@ for t, l in zip(tokens, out):
73
  print(f"{t}\t{id2label[l]}")
74
  ```
75
 
76
- For best results use Viterbi decoding (not argmax) — the `opf` CLI does this by
77
- default. If you're doing argmax with the HF transformers API, you'll see
78
- slightly more boundary errors but still excellent label accuracy.
79
 
80
  ## Performance
81
 
@@ -208,6 +263,20 @@ observed.
208
  - **Not a substitute for legal compliance review.** Use alongside a
209
  governance layer (human review, deterministic regex pre-filters, etc.).
210
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
211
  ## License
212
 
213
  Apache 2.0, same as the base model.
 
13
  - redaction
14
  - nemotron
15
  - privacy-filter
16
+ - openmed
17
  language:
18
  - en
19
  ---
20
 
21
  # privacy-filter-nemotron
22
 
23
+ Fine-tuned [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)
24
  for **fine-grained PII extraction** across **55 categories** from
25
+ [`nvidia/Nemotron-PII`](https://huggingface.co/datasets/nvidia/Nemotron-PII).
26
 
27
  - **Base model**: [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) — 1.4B-parameter MoE (50M active per token), BIOES token-classification head
28
  - **Task**: Token classification for PII detection (BIOES scheme)
 
37
  `credit_debit_card`, `ssn`, and so on — matching what downstream redaction
38
  and masking pipelines typically need.
39
 
40
+ > **Family at a glance.** Same architecture, three runtimes:
41
+ > - **PyTorch (this repo)** — CPU + CUDA, anywhere transformers runs.
42
+ > - **MLX BF16** — [`OpenMed/privacy-filter-nemotron-mlx`](https://huggingface.co/OpenMed/privacy-filter-nemotron-mlx) — Apple Silicon, full precision.
43
+ > - **MLX 8-bit** — [`OpenMed/privacy-filter-nemotron-mlx-8bit`](https://huggingface.co/OpenMed/privacy-filter-nemotron-mlx-8bit) — Apple Silicon, ~1.7× faster.
44
+
45
  ## Quick start
46
 
47
+ ### With [OpenMed](https://github.com/maziyarpanahi/openmed) recommended
48
+
49
+ OpenMed gives you `extract_pii()` / `deidentify()` with built-in BIOES Viterbi
50
+ decoding, span refinement, and a Faker-backed obfuscation engine. Same call
51
+ on every host — Apple Silicon picks up MLX automatically; everywhere else uses
52
+ this PyTorch checkpoint.
53
+
54
+ ```bash
55
+ pip install -U "openmed[hf]"
56
+ ```
57
+
58
+ ```python
59
+ from openmed import extract_pii, deidentify
60
+
61
+ text = (
62
+ "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, "
63
+ "phone 415-555-0123, email sarah.johnson@example.com."
64
+ )
65
+
66
+ # Extract grouped entity spans
67
+ result = extract_pii(text, model_name="OpenMed/privacy-filter-nemotron")
68
+ for ent in result.entities:
69
+ print(f"{ent.label:30s} {ent.text!r} conf={ent.confidence:.2f}")
70
+
71
+ # De-identify with any of the supported methods
72
+ masked = deidentify(text, method="mask", model_name="OpenMed/privacy-filter-nemotron")
73
+ removed = deidentify(text, method="remove", model_name="OpenMed/privacy-filter-nemotron")
74
+ hashed = deidentify(text, method="hash", model_name="OpenMed/privacy-filter-nemotron")
75
+
76
+ # Faker-backed locale-aware obfuscation, deterministic with consistent=True+seed
77
+ fake = deidentify(
78
+ text,
79
+ method="replace",
80
+ model_name="OpenMed/privacy-filter-nemotron",
81
+ consistent=True,
82
+ seed=42,
83
+ )
84
+ print(fake.deidentified_text)
85
+ ```
86
+
87
+ `OpenMed/privacy-filter-nemotron-mlx*` model names also work in the same
88
+ `extract_pii()` / `deidentify()` calls — on a non-Apple-Silicon host they
89
+ automatically fall back to **this PyTorch checkpoint** with a one-time
90
+ warning. So you can ship MLX names in code and still run on Linux/Windows.
91
+
92
+ The OpenMed wrapper passes `trust_remote_code=True` for you, runs the
93
+ model's own BIOES Viterbi decoder, and skips OpenMed's regex
94
+ smart-merging (the model already produces clean spans).
95
+
96
+ ### With `opf` — OpenAI's official CLI
97
 
98
  ```bash
99
  pip install 'opf @ git+https://github.com/openai/privacy-filter.git'
 
103
  --text "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, phone 415-555-0123."
104
  ```
105
 
106
+ ### With `transformers` directly
107
 
108
  ```python
109
  import torch
 
128
  print(f"{t}\t{id2label[l]}")
129
  ```
130
 
131
+ For best results use Viterbi decoding (not argmax) — both `opf` and OpenMed
132
+ do this by default. If you're doing argmax with the HF transformers API, you'll
133
+ see slightly more boundary errors but still excellent label accuracy.
134
 
135
  ## Performance
136
 
 
263
  - **Not a substitute for legal compliance review.** Use alongside a
264
  governance layer (human review, deterministic regex pre-filters, etc.).
265
 
266
+ ## Credits & Acknowledgements
267
+
268
+ This model wouldn't exist without two open-source releases — sincere thanks
269
+ to both teams:
270
+
271
+ - **OpenAI** for [open-sourcing the Privacy Filter](https://huggingface.co/openai/privacy-filter)
272
+ (architecture, modeling code, and `opf` training/eval CLI). Everything in
273
+ this repo is a fine-tune on top of that release.
274
+ - **NVIDIA** for releasing the [Nemotron-PII dataset](https://huggingface.co/datasets/nvidia/Nemotron-PII)
275
+ with its 100K-row train split and 55 fine-grained PII labels.
276
+
277
+ Additional thanks to the **HuggingFace** team for the `transformers` /
278
+ `huggingface_hub` ecosystem this model ships through.
279
+
280
  ## License
281
 
282
  Apache 2.0, same as the base model.