| --- |
| license: apache-2.0 |
| language: |
| - en |
| pipeline_tag: token-classification |
| library_name: transformers |
| tags: |
| - pii |
| - privacy |
| - token-classification |
| - bioes |
| - moe |
| - haremb |
| base_model: |
| - OpenMed/privacy-filter-nemotron |
| - openai/privacy-filter |
| datasets: |
| - nvidia/Nemotron-PII |
| --- |
| |
| # HarEmb · OpenMed-Nemotron PII |
|
|
| > A **single-layer** HarEmb model on the [`OpenMed/privacy-filter-nemotron`](https://huggingface.co/OpenMed/privacy-filter-nemotron) lineage. It has 287M total parameters and predicts the full **221-class BIOES** Nemotron-PII label space. |
|
|
| **Model**: [`fblgit/haremb-privacy-filter-opennemo`](https://huggingface.co/fblgit/haremb-privacy-filter-opennemo) |
|
|
|  |
|
|
| ## Lineage |
|
|
| This model is the third leg of a three-step lineage: |
|
|
| 1. **[`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)** — OpenAI's open release of the underlying 1.4B-parameter MoE backbone (8 transformer layers, ~50M active params/token, BIOES token classifier head). |
| 2. **[`OpenMed/privacy-filter-nemotron`](https://huggingface.co/OpenMed/privacy-filter-nemotron)** — OpenMed's full fine-tune of that backbone on `nvidia/Nemotron-PII`, expanding the head to 221 BIOES classes (55 fine-grained PII categories). |
| 3. **`haremb-privacy-filter-opennemo`** *(this model)* — a one-layer surgical slice of the OpenMed teacher. |
|
|
| ## What this model does |
|
|
| Token-level PII classification over **55 Nemotron-PII categories**. Every token receives one of `O` or `{B, I, E, S}-<category>`, covering identity, contact, address, date/time, government ID, financial, healthcare, enterprise ID, vehicle, and digital identifier categories. |
|
|
| In `eval()` mode the model can run constrained-BIOES Viterbi decoding internally, so `outputs.logits.argmax(-1)` is span-coherent by default. See [Output semantics](#output-semantics) for the exact fields and opt-out flags. |
|
|
| ## Evaluation |
|
|
| Evaluated on a 1% slice of `nvidia/Nemotron-PII:test` (1,000 documents, ctx 1024, seed 42), Viterbi-decoded. The benchmark and app both use the convention **A = `OpenMed/privacy-filter-nemotron` (teacher / baseline)**, **B = this checkpoint** (`haremb`); ratios are reported as **B ÷ A**. |
|
|
| ### Quality (viterbi stream) |
|
|
| | metric | **A: OpenMed teacher** | **B: haremb** (this) | B − A | |
| |---|---:|---:|---:| |
| | span F1 | 0.9434 | **0.9288** | −0.0146 | |
| | span precision | 0.9531 | **0.9396** | −0.0135 | |
| | span recall | 0.9338 | **0.9182** | −0.0156 | |
| | token accuracy | 0.9900 | **0.9885** | −0.0015 | |
| | non-O recall | 0.9703 | **0.9637** | −0.0066 | |
|
|
| ### Performance (same eval set, ctx 1024, bf16, single GPU) |
|
|
| | metric | **A: OpenMed teacher** | **B: haremb** | B vs A | |
| |---|---:|---:|---:| |
| | total params | 1,400M | **287M** | **4.87× smaller** | |
| | dense params | 139M | 130M | 1.07× smaller | |
| | MoE expert params | 1,260M | 158M | **7.97× smaller** | |
| | **active params / token** (memory) | 178.7M | **134.5M** | 1.33× smaller | |
| | **compute params / token** (FLOPs) | 50.7M | **6.5M** | **7.85× cheaper** | |
| | GFLOP / token (forward) | 0.101 | **0.013** | **7.85× cheaper** | |
| | weights on disk | (HF repo) | **548 MiB** | — | |
| | weights in RAM | 2,669 MiB | 548 MiB | **4.87× smaller** | |
| | peak GPU memory (eval) | 3.30 GiB | **1.22 GiB** | **2.70× less** | |
| | throughput | 3,275 tok/s | **6,361 tok/s** | **1.94× faster** | |
|
|
| `active params / token` estimates memory bandwidth pressure, while `compute params / token` estimates matmul FLOPs and excludes the embedding table row-gather. GFLOP/token is `2 × compute_params_per_token`. `infer.log` and `compare.log` contain the full breakdown, including peak GPU memory from `torch.cuda.max_memory_allocated`. |
|
|
|  |
|
|
| ### Quality breakdown |
|
|
|  |
|
|
| ### Per-category highlights (viterbi span F1) |
|
|
| **At or near 1.000 (B)** — `biometric_identifier`, `blood_type`, `coordinate`, `health_plan_beneficiary_number`, `ipv4`, `ipv6`, `license_plate`, `mac_address`, `national_id`, `postcode` (≥ 0.99 with ≥ 100 gold spans). |
|
|
| **Categories where B beats A** — `gender` (0.987 vs 0.841), `political_view` (0.872 vs 0.839), `religious_belief` (0.935 vs 0.926), `state` (0.908 vs 0.829), `language` (0.897 vs 0.804), `race_ethnicity` (0.864 vs 0.861), `country` (0.952 vs 0.936). Several "fuzzy" world-knowledge categories where the 1-layer student carries the right inductive bias. |
|
|
| **Categories where A leads** — `occupation` (0.727 vs 0.605), `company_name` (0.929 vs 0.776), `last_name` (0.976 vs 0.931), `first_name` (0.970 vs 0.930), `user_name` (0.961 vs 0.942). Identity-noun categories where the teacher's deeper-layer mixing helps. |
|
|
| ### Token-outcome breakdown — A: OpenMed teacher vs B: haremb (viterbi) |
|
|
|  |
|
|
| ## Quick start |
|
|
| ### Recommended — via OpenMed |
|
|
| The OpenMed wrapper is the same UX the teacher card recommends and works on this checkpoint as a drop-in: |
|
|
| ```bash |
| pip install -U "openmed[hf]" |
| ``` |
|
|
| ```python |
| from openmed import extract_pii, deidentify |
| |
| text = ( |
| "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, " |
| "phone 415-555-0123, email sarah.johnson@example.com." |
| ) |
| |
| result = extract_pii(text, model_name="fblgit/haremb-privacy-filter-opennemo") |
| for ent in result.entities: |
| print(f"{ent.label:30s} {ent.text!r} conf={ent.confidence:.2f}") |
| |
| masked = deidentify(text, method="mask", |
| model_name="fblgit/haremb-privacy-filter-opennemo") |
| fake = deidentify(text, method="replace", |
| model_name="fblgit/haremb-privacy-filter-opennemo", |
| consistent=True, seed=42) |
| ``` |
|
|
| ### HuggingFace `transformers` pipeline |
|
|
| ```python |
| from transformers import pipeline |
| |
| pipe = pipeline( |
| "token-classification", |
| model="fblgit/haremb-privacy-filter-opennemo", |
| tokenizer="fblgit/haremb-privacy-filter-opennemo", |
| trust_remote_code=True, |
| aggregation_strategy="simple", |
| ) |
| |
| pipe("Send the invoice to billing@acmecorp.io, account 1234-5678.") |
| # → [{'entity_group': 'email', 'word': 'billing@acmecorp.io', ...}, |
| # {'entity_group': 'account_number', 'word': '1234-5678', ...}] |
| ``` |
|
|
| ### Raw `transformers` API |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForTokenClassification, AutoTokenizer |
| |
| repo = "fblgit/haremb-privacy-filter-opennemo" |
| model = AutoModelForTokenClassification.from_pretrained( |
| repo, trust_remote_code=True, dtype=torch.bfloat16, |
| ).to("cuda").eval() |
| tok = AutoTokenizer.from_pretrained(repo) |
| |
| enc = tok("My email is foo@bar.com.", return_tensors="pt").to("cuda") |
| with torch.no_grad(): |
| out = model(**enc) |
| |
| # By default, `outputs.logits.argmax(-1)` follows the Viterbi-decoded path. |
| labels = out.logits.argmax(-1)[0] |
| ``` |
|
|
| ## Output semantics |
|
|
| The forward pass — in `eval()` mode — runs constrained-BIOES Viterbi over the per-token logits and attaches three things to the output: |
|
|
| - `outputs.logits` — a tensor whose `argmax(-1)` equals the Viterbi prediction (so HF `pipeline()` and naive `argmax` consumers get span-coherent predictions automatically). |
| - `outputs.predicted_labels` — a `[B, T]` LongTensor of Viterbi-decoded label ids (`-1` at padded positions). |
| - `outputs.raw_logits` — the original per-token logits, preserved for callers that want raw confidences. |
|
|
| To opt out: |
|
|
| ```python |
| model.config.viterbi_replace_logits = False # raw logits in outputs.logits |
| model.config.use_viterbi_decode = False # also skip Viterbi entirely |
| ``` |
|
|
| The model supports the upstream context length (max position embeddings 131,072 tokens). Practical batch sizes depend on hardware; bf16 + batch 1 + full-length is comfortable on 24 GB. |
|
|
| ## Limitations & intended use |
|
|
| - **English-only training data.** Nemotron-PII is predominantly English. Performance on non-English text is not guaranteed. |
| - **Synthetic training data.** Real clinical notes, legal documents, and live web text may show different surface forms. For high-stakes deployments, collect a domain-specific eval set and re-calibrate. |
| - **Fuzzier categories** — `occupation`, `company_name`, and identity nouns (`first_name`, `last_name`, `user_name`) carry more uncertainty than formatted identifiers; downstream pipelines that only need strict PII can ignore low-confidence predictions on these. |
| - **Not a substitute for legal compliance review.** Use alongside a governance layer (human review, deterministic regex pre-filters, etc.). |
|
|
| ## Reproducibility |
|
|
| Every metric, log, and plot in this card is regenerated by the single-file [`benchmark.py`](benchmark.py) shipped alongside the weights: |
|
|
| ```bash |
| python benchmark.py # full benchmark vs OpenMed teacher |
| python benchmark.py --no-base # skip teacher download (logs only) |
| python benchmark.py --no-plots # skip matplotlib (logs + JSON only) |
| python benchmark.py --eval-pct 0.1 # smaller slice for a quick check |
| ``` |
|
|
| Outputs into the model folder: |
|
|
| - `infer.log` |
| - `compare.log` |
| - `eval_summary.png` |
| - `eval_confusion.png` |
| - `eval_performance.png` |
|
|
| Raw per-doc eval data is held in memory only. Pass `--out` to write artifacts somewhere else. |
|
|
| The Gradio demo in [`app.py`](app.py) supports **side-by-side A-vs-B comparison** between any two token-classification checkpoints with the same label space. Defaults match the report convention: **A = OpenMed/privacy-filter-nemotron** (teacher / baseline), **B = this checkpoint**. Disable either model to run single-model inference; both expose a runtime "active experts per token" slider so you can sweep MoE routing density. From inside the model folder: |
|
|
| ```bash |
| python app.py # A=OpenMed teacher, B=. (this) |
| python app.py --model-a /path/to/another/repo # swap baseline A |
| python app.py --model-b /path/to/another/repo # swap candidate B |
| python app.py --port 7860 --share # public share link |
| ``` |
|
|
| ## License |
|
|
| Apache-2.0, same as the lineage. Subject to the license terms of [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) and the dataset terms of [`nvidia/Nemotron-PII`](https://huggingface.co/datasets/nvidia/Nemotron-PII). |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{haremb-privacy-filter-opennemo, |
| title = {HarEmb · OpenMed-Nemotron PII: a single-layer |
| privacy-filter slice with span-coherent inference}, |
| author = {fblgit}, |
| year = {2026}, |
| publisher = {Hugging Face}, |
| url = {https://huggingface.co/fblgit/haremb-privacy-filter-opennemo}, |
| howpublished = {\url{https://huggingface.co/fblgit/haremb-privacy-filter-opennemo}}, |
| note = {Single-transformer-layer model on the openai/privacy-filter → |
| OpenMed/privacy-filter-nemotron lineage; 287M total params, |
| 221 BIOES classes (55 fine-grained PII categories), with |
| inlined constrained-BIOES Viterbi decoding so |
| outputs.logits.argmax(-1) is span-coherent.} |
| } |
| |
| @misc{openmed-privacy-filter-nemotron, |
| title = {OpenMed/privacy-filter-nemotron: fine-grained PII extraction |
| with 55 categories}, |
| author = {OpenMed}, |
| year = {2026}, |
| publisher = {Hugging Face}, |
| url = {https://huggingface.co/OpenMed/privacy-filter-nemotron} |
| } |
| |
| @misc{openai-privacy-filter, |
| title = {Privacy Filter}, |
| author = {OpenAI}, |
| year = {2026}, |
| publisher = {Hugging Face}, |
| url = {https://huggingface.co/openai/privacy-filter} |
| } |
| |
| @misc{nvidia-nemotron-pii, |
| title = {Nemotron-PII}, |
| author = {NVIDIA}, |
| year = {2025}, |
| publisher = {Hugging Face}, |
| url = {https://huggingface.co/datasets/nvidia/Nemotron-PII} |
| } |
| ``` |