fblgit commited on 3 days ago

Commit

f0f5785

1 Parent(s): 8fb7b0f

Upload folder using huggingface_hub

Browse files

Files changed (17) hide show

.gitattributes +4 -0
.gitignore +1 -0
README.md +258 -0
app.py +534 -0
benchmark.py +1351 -0
compare.log +188 -0
config.json +788 -0
configuration_haremb_pii.py +47 -0
eval_confusion.png +3 -0
eval_performance.png +0 -0
eval_summary.png +3 -0
haremb.png +3 -0
infer.log +51 -0
model.safetensors +3 -0
modeling_haremb_pii.py +270 -0
tokenizer.json +3 -0
tokenizer_config.json +13 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+eval_confusion.png filter=lfs diff=lfs merge=lfs -text
+eval_summary.png filter=lfs diff=lfs merge=lfs -text
+haremb.png filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ __pycache__

README.md ADDED Viewed

	@@ -0,0 +1,258 @@

+---
+license: apache-2.0
+language:
+  - en
+pipeline_tag: token-classification
+library_name: transformers
+tags:
+  - pii
+  - privacy
+  - token-classification
+  - bioes
+  - moe
+  - haremb
+base_model:
+  - OpenMed/privacy-filter-nemotron
+datasets:
+  - nvidia/Nemotron-PII
+---
+# HarEmb · OpenMed-Nemotron PII
+> A **single-layer** HarEmb model on the [`OpenMed/privacy-filter-nemotron`](https://huggingface.co/OpenMed/privacy-filter-nemotron) lineage. It has 287M total parameters and predicts the full **221-class BIOES** Nemotron-PII label space.
+**Model**: [`fblgit/haremb-privacy-filter-opennemo`](https://huggingface.co/fblgit/haremb-privacy-filter-opennemo)
+![HarEmb architecture](haremb.png)
+## Lineage
+This model is the third leg of a three-step lineage:
+1. **[`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter)** — OpenAI's open release of the underlying 1.4B-parameter MoE backbone (8 transformer layers, ~50M active params/token, BIOES token classifier head).
+2. **[`OpenMed/privacy-filter-nemotron`](https://huggingface.co/OpenMed/privacy-filter-nemotron)** — OpenMed's full fine-tune of that backbone on `nvidia/Nemotron-PII`, expanding the head to 221 BIOES classes (55 fine-grained PII categories).
+3. **`haremb-privacy-filter-opennemo`** *(this model)* — a one-layer surgical slice of the OpenMed teacher.
+## What this model does
+Token-level PII classification over **55 Nemotron-PII categories**. Every token receives one of `O` or `{B, I, E, S}-<category>`, covering identity, contact, address, date/time, government ID, financial, healthcare, enterprise ID, vehicle, and digital identifier categories.
+In `eval()` mode the model can run constrained-BIOES Viterbi decoding internally, so `outputs.logits.argmax(-1)` is span-coherent by default. See [Output semantics](#output-semantics) for the exact fields and opt-out flags.
+## Evaluation
+Evaluated on a 1% slice of `nvidia/Nemotron-PII:test` (1,000 documents, ctx 1024, seed 42), Viterbi-decoded. The benchmark and app both use the convention **A = `OpenMed/privacy-filter-nemotron` (teacher / baseline)**, **B = this checkpoint** (`haremb`); ratios are reported as **B ÷ A**.
+### Quality (viterbi stream)
+| metric | **A: OpenMed teacher** | **B: haremb** (this) | B − A |
+|---|---:|---:|---:|
+| span F1 | 0.9434 | **0.9288** | −0.0146 |
+| span precision | 0.9531 | **0.9396** | −0.0135 |
+| span recall | 0.9338 | **0.9182** | −0.0156 |
+| token accuracy | 0.9900 | **0.9885** | −0.0015 |
+| non-O recall | 0.9703 | **0.9637** | −0.0066 |
+### Performance (same eval set, ctx 1024, bf16, single GPU)
+| metric | **A: OpenMed teacher** | **B: haremb** | B vs A |
+|---|---:|---:|---:|
+| total params | 1,400M | **287M** | **4.87× smaller** |
+| dense params | 139M | 130M | 1.07× smaller |
+| MoE expert params | 1,260M | 158M | **7.97× smaller** |
+| **active params / token** (memory) | 178.7M | **134.5M** | 1.33× smaller |
+| **compute params / token** (FLOPs) | 50.7M | **6.5M** | **7.85× cheaper** |
+| GFLOP / token (forward) | 0.101 | **0.013** | **7.85× cheaper** |
+| weights on disk | (HF repo) | **548 MiB** | — |
+| weights in RAM | 2,669 MiB | 548 MiB | **4.87× smaller** |
+| peak GPU memory (eval) | 3.30 GiB | **1.22 GiB** | **2.70× less** |
+| throughput | 3,275 tok/s | **6,361 tok/s** | **1.94× faster** |
+`active params / token` estimates memory bandwidth pressure, while `compute params / token` estimates matmul FLOPs and excludes the embedding table row-gather. GFLOP/token is `2 × compute_params_per_token`. `infer.log` and `compare.log` contain the full breakdown, including peak GPU memory from `torch.cuda.max_memory_allocated`.
+![Performance profile — absolute footprint and B/A ratio, A teacher vs B candidate](eval_performance.png)
+### Quality breakdown
+![Eval summary — headline metrics, raw-vs-viterbi span F1, and selected per-category deltas](eval_summary.png)
+### Per-category highlights (viterbi span F1)
+**At or near 1.000 (B)** — `biometric_identifier`, `blood_type`, `coordinate`, `health_plan_beneficiary_number`, `ipv4`, `ipv6`, `license_plate`, `mac_address`, `national_id`, `postcode` (≥ 0.99 with ≥ 100 gold spans).
+**Categories where B beats A** — `gender` (0.987 vs 0.841), `political_view` (0.872 vs 0.839), `religious_belief` (0.935 vs 0.926), `state` (0.908 vs 0.829), `language` (0.897 vs 0.804), `race_ethnicity` (0.864 vs 0.861), `country` (0.952 vs 0.936). Several "fuzzy" world-knowledge categories where the 1-layer student carries the right inductive bias.
+**Categories where A leads** — `occupation` (0.727 vs 0.605), `company_name` (0.929 vs 0.776), `last_name` (0.976 vs 0.931), `first_name` (0.970 vs 0.930), `user_name` (0.961 vs 0.942). Identity-noun categories where the teacher's deeper-layer mixing helps.
+### Token-outcome breakdown — A: OpenMed teacher vs B: haremb (viterbi)
+![Pairwise token outcome and net category wins on gold non-O tokens](eval_confusion.png)
+## Quick start
+### Recommended ��� via OpenMed
+The OpenMed wrapper is the same UX the teacher card recommends and works on this checkpoint as a drop-in:
+```bash
+pip install -U "openmed[hf]"
+```
+```python
+from openmed import extract_pii, deidentify
+text = (
+    "Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, "
+    "phone 415-555-0123, email sarah.johnson@example.com."
+)
+result = extract_pii(text, model_name="fblgit/haremb-privacy-filter-opennemo")
+for ent in result.entities:
+    print(f"{ent.label:30s} {ent.text!r}  conf={ent.confidence:.2f}")
+masked = deidentify(text, method="mask",
+                    model_name="fblgit/haremb-privacy-filter-opennemo")
+fake = deidentify(text, method="replace",
+                  model_name="fblgit/haremb-privacy-filter-opennemo",
+                  consistent=True, seed=42)
+```
+### HuggingFace `transformers` pipeline
+```python
+from transformers import pipeline
+pipe = pipeline(
+    "token-classification",
+    model="fblgit/haremb-privacy-filter-opennemo",
+    tokenizer="fblgit/haremb-privacy-filter-opennemo",
+    trust_remote_code=True,
+    aggregation_strategy="simple",
+)
+pipe("Send the invoice to billing@acmecorp.io, account 1234-5678.")
+# → [{'entity_group': 'email',           'word': 'billing@acmecorp.io', ...},
+#    {'entity_group': 'account_number',  'word': '1234-5678',           ...}]
+```
+### Raw `transformers` API
+```python
+import torch
+from transformers import AutoModelForTokenClassification, AutoTokenizer
+repo = "fblgit/haremb-privacy-filter-opennemo"
+model = AutoModelForTokenClassification.from_pretrained(
+    repo, trust_remote_code=True, dtype=torch.bfloat16,
+).to("cuda").eval()
+tok = AutoTokenizer.from_pretrained(repo)
+enc = tok("My email is foo@bar.com.", return_tensors="pt").to("cuda")
+with torch.no_grad():
+    out = model(**enc)
+# By default, `outputs.logits.argmax(-1)` follows the Viterbi-decoded path.
+labels = out.logits.argmax(-1)[0]
+```
+## Output semantics
+The forward pass — in `eval()` mode — runs constrained-BIOES Viterbi over the per-token logits and attaches three things to the output:
+- `outputs.logits` — a tensor whose `argmax(-1)` equals the Viterbi prediction (so HF `pipeline()` and naive `argmax` consumers get span-coherent predictions automatically).
+- `outputs.predicted_labels` — a `[B, T]` LongTensor of Viterbi-decoded label ids (`-1` at padded positions).
+- `outputs.raw_logits` — the original per-token logits, preserved for callers that want raw confidences.
+To opt out:
+```python
+model.config.viterbi_replace_logits = False   # raw logits in outputs.logits
+model.config.use_viterbi_decode = False       # also skip Viterbi entirely
+```
+The model supports the upstream context length (max position embeddings 131,072 tokens). Practical batch sizes depend on hardware; bf16 + batch 1 + full-length is comfortable on 24 GB.
+## Limitations & intended use
+- **English-only training data.** Nemotron-PII is predominantly English. Performance on non-English text is not guaranteed.
+- **Synthetic training data.** Real clinical notes, legal documents, and live web text may show different surface forms. For high-stakes deployments, collect a domain-specific eval set and re-calibrate.
+- **Fuzzier categories** — `occupation`, `company_name`, and identity nouns (`first_name`, `last_name`, `user_name`) carry more uncertainty than formatted identifiers; downstream pipelines that only need strict PII can ignore low-confidence predictions on these.
+- **Not a substitute for legal compliance review.** Use alongside a governance layer (human review, deterministic regex pre-filters, etc.).
+## Reproducibility
+Every metric, log, and plot in this card is regenerated by the single-file [`benchmark.py`](benchmark.py) shipped alongside the weights:
+```bash
+python benchmark.py                 # full benchmark vs OpenMed teacher
+python benchmark.py --no-base       # skip teacher download (logs only)
+python benchmark.py --no-plots      # skip matplotlib (logs + JSON only)
+python benchmark.py --eval-pct 0.1  # smaller slice for a quick check
+```
+Outputs into the model folder:
+- `infer.log`
+- `compare.log`
+- `eval_summary.png`
+- `eval_confusion.png`
+- `eval_performance.png`
+Raw per-doc eval data is held in memory only. Pass `--out` to write artifacts somewhere else.
+The Gradio demo in [`app.py`](app.py) supports **side-by-side A-vs-B comparison** between any two token-classification checkpoints with the same label space. Defaults match the report convention: **A = OpenMed/privacy-filter-nemotron** (teacher / baseline), **B = this checkpoint**. Disable either model to run single-model inference; both expose a runtime "active experts per token" slider so you can sweep MoE routing density. From inside the model folder:
+```bash
+python app.py                                   # A=OpenMed teacher, B=. (this)
+python app.py --model-a /path/to/another/repo   # swap baseline A
+python app.py --model-b /path/to/another/repo   # swap candidate B
+python app.py --port 7860 --share               # public share link
+```
+## License
+Apache-2.0, same as the lineage. Subject to the license terms of [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) and the dataset terms of [`nvidia/Nemotron-PII`](https://huggingface.co/datasets/nvidia/Nemotron-PII).
+## Citation
+```bibtex
+@misc{haremb-privacy-filter-opennemo,
+  title        = {HarEmb · OpenMed-Nemotron PII: a single-layer
+                  privacy-filter slice with span-coherent inference},
+  author       = {fblgit},
+  year         = {2026},
+  publisher    = {Hugging Face},
+  url          = {https://huggingface.co/fblgit/haremb-privacy-filter-opennemo},
+  howpublished = {\url{https://huggingface.co/fblgit/haremb-privacy-filter-opennemo}},
+  note         = {Single-transformer-layer model on the openai/privacy-filter →
+                  OpenMed/privacy-filter-nemotron lineage; 287M total params,
+                  221 BIOES classes (55 fine-grained PII categories), with
+                  inlined constrained-BIOES Viterbi decoding so
+                  outputs.logits.argmax(-1) is span-coherent.}
+}
+@misc{openmed-privacy-filter-nemotron,
+  title     = {OpenMed/privacy-filter-nemotron: fine-grained PII extraction
+               with 55 categories},
+  author    = {OpenMed},
+  year      = {2026},
+  publisher = {Hugging Face},
+  url       = {https://huggingface.co/OpenMed/privacy-filter-nemotron}
+}
+@misc{openai-privacy-filter,
+  title     = {Privacy Filter},
+  author    = {OpenAI},
+  year      = {2026},
+  publisher = {Hugging Face},
+  url       = {https://huggingface.co/openai/privacy-filter}
+}
+@misc{nvidia-nemotron-pii,
+  title     = {Nemotron-PII},
+  author    = {NVIDIA},
+  year      = {2025},
+  publisher = {Hugging Face},
+  url       = {https://huggingface.co/datasets/nvidia/Nemotron-PII}
+}
+```

app.py ADDED Viewed

	@@ -0,0 +1,534 @@

+"""
+HarEmb PII — local Gradio inference demo.
+Upload a PDF (or paste text), pick a device (CPU / cuda:N), and the model
+highlights detected PII spans across the 55-category Nemotron-PII taxonomy.
+Install:
+    pip install "gradio>=4" "transformers>=4.45" torch pypdf accelerate
+Run from inside this folder:
+    python app.py
+"""
+from __future__ import annotations
+import argparse
+import re
+from pathlib import Path
+from typing import Dict, List, Optional, Tuple
+import gradio as gr
+import torch
+from pypdf import PdfReader
+from transformers import pipeline
+# Default to loading from this folder so `python app.py` works in-place after
+# downloading the repo. Override by setting --model-path on the CLI.
+DEFAULT_MODEL = "."
+CHUNK_CHARS = 400_000  # ~100k tokens; well under the model's 131k window
+# 55 Nemotron-PII categories grouped for visual coherence; one color per
+# coarse "family" so the highlight legend stays readable.
+PALETTE: Dict[str, str] = {
+    # Identity (red)
+    "first_name":          "#ef4444",
+    "last_name":           "#ef4444",
+    "user_name":           "#ef4444",
+    "company_name":        "#ef4444",
+    "age":                 "#fb7185",
+    "gender":              "#fb7185",
+    "race_ethnicity":      "#fb7185",
+    "sexuality":           "#fb7185",
+    "religious_belief":    "#fb7185",
+    "political_view":      "#fb7185",
+    "language":            "#fb7185",
+    "education_level":     "#fb7185",
+    "occupation":          "#fb7185",
+    "employment_status":   "#fb7185",
+    "blood_type":          "#fb7185",
+    "biometric_identifier":"#fb7185",
+    # Contact (purple)
+    "email":               "#8b5cf6",
+    "phone_number":        "#a78bfa",
+    "fax_number":          "#a78bfa",
+    "url":                 "#7c3aed",
+    # Address (green)
+    "street_address":      "#10b981",
+    "city":                "#34d399",
+    "county":              "#34d399",
+    "state":               "#34d399",
+    "country":             "#34d399",
+    "postcode":            "#34d399",
+    "coordinate":          "#059669",
+    # Dates (blue)
+    "date":                "#3b82f6",
+    "date_of_birth":       "#60a5fa",
+    "date_time":           "#60a5fa",
+    "time":                "#60a5fa",
+    # Government IDs (orange)
+    "ssn":                 "#f97316",
+    "national_id":         "#fb923c",
+    "tax_id":              "#fb923c",
+    # Financial (amber)
+    "account_number":      "#f59e0b",
+    "bank_routing_number": "#fbbf24",
+    "swift_bic":           "#fbbf24",
+    "credit_debit_card":   "#fbbf24",
+    "cvv":                 "#fbbf24",
+    "pin":                 "#fbbf24",
+    "password":            "#d97706",
+    # Healthcare (pink)
+    "medical_record_number":          "#ec4899",
+    "health_plan_beneficiary_number": "#f472b6",
+    # Enterprise IDs (cyan)
+    "customer_id":                 "#06b6d4",
+    "employee_id":                 "#06b6d4",
+    "unique_id":                   "#22d3ee",
+    "certificate_license_number":  "#22d3ee",
+    # Vehicle (lime)
+    "license_plate":       "#84cc16",
+    "vehicle_identifier":  "#84cc16",
+    # Digital (indigo)
+    "ipv4":                "#6366f1",
+    "ipv6":                "#6366f1",
+    "mac_address":         "#818cf8",
+    "device_identifier":   "#818cf8",
+    "api_key":             "#4f46e5",
+    "http_cookie":         "#4f46e5",
+}
+def list_devices() -> List[str]:
+    devs = ["cpu"]
+    if torch.cuda.is_available():
+        for i in range(torch.cuda.device_count()):
+            devs.append(f"cuda:{i}")
+    return devs
+_pipe_cache: Dict[Tuple[str, str], object] = {}
+def get_pipe(model_path: str, device: str):
+    key = (model_path, device)
+    if key in _pipe_cache:
+        return _pipe_cache[key]
+    dtype = torch.bfloat16 if device.startswith("cuda") else torch.float32
+    pipe = pipeline(
+        "token-classification",
+        model=model_path,
+        tokenizer=model_path,
+        trust_remote_code=True,
+        aggregation_strategy="simple",
+        device=device,
+        torch_dtype=dtype,
+    )
+    _pipe_cache[key] = pipe
+    return pipe
+def apply_runtime_config(
+    pipe,
+    use_viterbi: bool,
+    viterbi_replace: bool,
+    top_k: Optional[int] = None,
+) -> None:
+    cfg = pipe.model.config
+    if hasattr(cfg, "use_viterbi_decode"):
+        cfg.use_viterbi_decode = bool(use_viterbi)
+    if hasattr(cfg, "viterbi_replace_logits"):
+        cfg.viterbi_replace_logits = bool(viterbi_replace)
+    # Override the per-layer MoE top-k at inference. Both fields need to be
+    # set: `mlp.router.top_k` is the actual router top-k, and the upstream
+    # `mlp.num_experts` is misnamed (it's also the per-token top_k, not
+    # num_local_experts). top_k=None leaves the trained config alone.
+    if top_k is not None:
+        n_local = int(getattr(cfg, "num_local_experts", 128))
+        k = max(1, min(int(top_k), n_local))
+        for layer in pipe.model.model.layers:
+            mlp = getattr(layer, "mlp", None)
+            if mlp is None:
+                continue
+            router = getattr(mlp, "router", None)
+            if router is not None and hasattr(router, "top_k"):
+                router.top_k = k
+            if hasattr(mlp, "num_experts"):
+                mlp.num_experts = k
+def model_top_k_default(model_path: str) -> int:
+    """Read the trained `num_experts_per_tok` from the model's config without
+    loading the weights. Falls back to 4 if the field isn't present."""
+    try:
+        from transformers import AutoConfig
+        cfg = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
+        return int(getattr(cfg, "num_experts_per_tok", 4))
+    except Exception:
+        return 4
+def model_num_experts(model_path: str) -> int:
+    """Read `num_local_experts` from the model's config without loading
+    weights. Falls back to 128 if the field isn't present."""
+    try:
+        from transformers import AutoConfig
+        cfg = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
+        return int(getattr(cfg, "num_local_experts", 128))
+    except Exception:
+        return 128
+def clear_model_cache() -> str:
+    _pipe_cache.clear()
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+    return "Model cache cleared. Next run will reload weights."
+def extract_text(file_obj) -> str:
+    if file_obj is None:
+        return ""
+    path = file_obj.name if hasattr(file_obj, "name") else file_obj
+    p = Path(path)
+    if p.suffix.lower() == ".pdf":
+        reader = PdfReader(str(p))
+        return "\n\n".join((page.extract_text() or "") for page in reader.pages)
+    return p.read_text(encoding="utf-8", errors="replace")
+def chunk_text(text: str, max_chars: int = CHUNK_CHARS) -> List[Tuple[int, str]]:
+    if not text:
+        return []
+    if max_chars <= 0 or len(text) <= max_chars:
+        return [(0, text)]
+    pieces = re.split(r"(\n\s*\n)", text)
+    chunks: List[Tuple[int, str]] = []
+    cur, cur_off, pos = "", 0, 0
+    for piece in pieces:
+        if cur and len(cur) + len(piece) > max_chars and cur.strip():
+            chunks.append((cur_off, cur))
+            cur, cur_off = piece, pos
+        else:
+            if not cur:
+                cur_off = pos
+            cur += piece
+        pos += len(piece)
+    if cur.strip():
+        chunks.append((cur_off, cur))
+    return chunks
+def category_of(label: str) -> str:
+    if len(label) > 2 and label[1] == "-":
+        return label[2:]
+    return label
+def predict(
+    model_path: str,
+    device: str,
+    text: str,
+    aggregation: str,
+    use_viterbi: bool,
+    viterbi_replace: bool,
+    top_k: Optional[int] = None,
+    chunk_chars: int = CHUNK_CHARS,
+) -> List[Dict]:
+    if not text.strip():
+        return []
+    pipe = get_pipe(model_path, device)
+    apply_runtime_config(pipe, use_viterbi, viterbi_replace, top_k=top_k)
+    spans: List[Dict] = []
+    for offset, chunk in chunk_text(text, max_chars=chunk_chars):
+        for ent in pipe(chunk, aggregation_strategy=aggregation):
+            label = ent.get("entity_group") or ent.get("entity") or ""
+            cat = category_of(label)
+            if cat not in PALETTE:
+                continue
+            s = ent["start"] + offset
+            e = ent["end"] + offset
+            spans.append({
+                "start": s, "end": e, "label": cat,
+                "score": float(ent["score"]),
+                "text": text[s:e],
+            })
+    spans.sort(key=lambda s: (s["start"], s["end"]))
+    return spans
+def to_highlight(text: str, spans: List[Dict]) -> List[Tuple[str, Optional[str]]]:
+    if not text:
+        return []
+    out: List[Tuple[str, Optional[str]]] = []
+    cur = 0
+    for s in spans:
+        if s["start"] < cur:
+            continue
+        if s["start"] > cur:
+            out.append((text[cur:s["start"]], None))
+        out.append((text[s["start"]:s["end"]], s["label"]))
+        cur = s["end"]
+    if cur < len(text):
+        out.append((text[cur:], None))
+    return out
+def fmt_spans(spans: List[Dict], max_rows: int = 60) -> str:
+    if not spans:
+        return "_No PII spans detected._"
+    rows = [
+        f"- `{s['label']}` &nbsp; `{s['text'][:80].replace('`', '')}` &nbsp; (score {s['score']:.2f})"
+        for s in spans[:max_rows]
+    ]
+    more = f"\n\n_…+{len(spans) - max_rows} more_" if len(spans) > max_rows else ""
+    return f"**Detected {len(spans)} span(s):**\n" + "\n".join(rows) + more
+# Build legend HTML for the categories present in PALETTE — one row per family
+# (we still want it readable; show one swatch per unique color).
+def _legend_html() -> str:
+    seen = {}
+    for name, c in PALETTE.items():
+        seen.setdefault(c, []).append(name)
+    rows = []
+    for c, names in seen.items():
+        chip = (f"<span style='background:{c};color:#fff;padding:.15rem .55rem;"
+                f"border-radius:.3rem;font-family:monospace;'>"
+                f"{names[0]}{(' +'+str(len(names)-1)) if len(names)>1 else ''}</span>")
+        rows.append(chip)
+    return ("<div style='display:flex;flex-wrap:wrap;gap:.4rem;font-size:.85rem;"
+            "margin:.25rem 0;'>" + "".join(rows) + "</div>")
+LEGEND_HTML = _legend_html()
+def diff_spans(a: List[Dict], b: List[Dict]):
+    """Return (only_in_a, only_in_b, agreed) span-lists. Keys are the
+    (start, end, label) triple — agreement requires identical category."""
+    key = lambda s: (s["start"], s["end"], s["label"])
+    sa = {key(s): s for s in a}
+    sb = {key(s): s for s in b}
+    only_a = [sa[k] for k in sa if k not in sb]
+    only_b = [sb[k] for k in sb if k not in sa]
+    both = [sa[k] for k in sa if k in sb]
+    return only_a, only_b, both
+def fmt_diff(label_a: str, label_b: str,
+             only_a: List[Dict], only_b: List[Dict], agreed: List[Dict]) -> str:
+    def fmt(name: str, lst: List[Dict]) -> str:
+        if not lst:
+            return f"**{name}:** none"
+        rows = [
+            f"- `{s['label']}` &nbsp; `{s['text'][:80].replace('`', '')}` "
+            f"&nbsp; (score {s['score']:.2f})"
+            for s in lst[:30]
+        ]
+        more = f"\n  …+{len(lst) - 30} more" if len(lst) > 30 else ""
+        return f"**{name} ({len(lst)}):**\n" + "\n".join(rows) + more
+    return "\n\n".join([
+        fmt(f"Only {label_a}", only_a),
+        fmt(f"Only {label_b}", only_b),
+        fmt("Agreed by both", agreed),
+    ])
+def run(
+    file_obj, pasted_text, device,
+    model_a_path, model_b_path,
+    use_a, use_b,
+    aggregation, use_viterbi, viterbi_replace,
+    top_k_a, top_k_b,
+    min_score, chunk_chars,
+):
+    text = extract_text(file_obj) if file_obj else (pasted_text or "")
+    if not text.strip():
+        return [], [], "_Provide a PDF, a text file, or pasted text._", ""
+    if not (use_a or use_b):
+        return [], [], "_Enable at least one model._", text
+    a_spans = (
+        predict(model_a_path, device, text, aggregation,
+                use_viterbi, viterbi_replace,
+                top_k=int(top_k_a), chunk_chars=int(chunk_chars))
+        if use_a else []
+    )
+    b_spans = (
+        predict(model_b_path, device, text, aggregation,
+                use_viterbi, viterbi_replace,
+                top_k=int(top_k_b), chunk_chars=int(chunk_chars))
+        if use_b else []
+    )
+    thr = float(min_score)
+    a_spans = [s for s in a_spans if s["score"] >= thr]
+    b_spans = [s for s in b_spans if s["score"] >= thr]
+    a_hl = to_highlight(text, a_spans) if use_a else []
+    b_hl = to_highlight(text, b_spans) if use_b else []
+    label_a = Path(model_a_path).name or model_a_path
+    label_b = Path(model_b_path).name or model_b_path
+    if use_a and use_b:
+        only_a, only_b, agreed = diff_spans(a_spans, b_spans)
+        diff_md = fmt_diff(label_a, label_b, only_a, only_b, agreed)
+    elif use_a:
+        diff_md = fmt_spans(a_spans)
+    elif use_b:
+        diff_md = fmt_spans(b_spans)
+    else:
+        diff_md = "_Enable a model._"
+    return a_hl, b_hl, diff_md, text
+def build_ui(default_model_a: str, default_model_b: str) -> gr.Blocks:
+    a_default_k = model_top_k_default(default_model_a)
+    a_n_experts = model_num_experts(default_model_a)
+    b_default_k = model_top_k_default(default_model_b)
+    b_n_experts = model_num_experts(default_model_b)
+    with gr.Blocks(title="HarEmb PII") as demo:
+        gr.Markdown(
+            "# HarEmb · OpenMed-Nemotron PII\n"
+            "Detect PII across 55 categories of the Nemotron-PII taxonomy. "
+            "Run **two models side-by-side** to compare detections — by "
+            "default this checkpoint vs the OpenMed teacher it was distilled "
+            "from. Disable one model to view a single detection."
+        )
+        devices = list_devices()
+        with gr.Row():
+            device_dd = gr.Dropdown(devices, value=devices[0], label="Device", scale=1)
+            clear_btn = gr.Button("Clear model cache", variant="secondary", scale=1)
+        with gr.Row():
+            with gr.Column():
+                use_a = gr.Checkbox(value=True, label="Enable model A (teacher / baseline)")
+                model_a_tb = gr.Textbox(
+                    value=default_model_a,
+                    label="Model A — path / HF repo",
+                    info="Default: OpenMed/privacy-filter-nemotron (teacher).",
+                )
+                top_k_a_sl = gr.Slider(
+                    1, a_n_experts, value=a_default_k, step=1,
+                    label=f"Active experts per token (top-k of {a_n_experts})",
+                    info=f"Trained value: {a_default_k}. Lower = faster + less "
+                         f"capacity per token. Higher = more compute, denser "
+                         f"routing. Bypassing the trained value can drop "
+                         f"quality — useful for ablations.",
+                )
+            with gr.Column():
+                use_b = gr.Checkbox(value=True, label="Enable model B (this checkpoint)")
+                model_b_tb = gr.Textbox(
+                    value=default_model_b,
+                    label="Model B — path / HF repo",
+                    info="Default: ./ (this checkpoint).",
+                )
+                top_k_b_sl = gr.Slider(
+                    1, b_n_experts, value=b_default_k, step=1,
+                    label=f"Active experts per token (top-k of {b_n_experts})",
+                    info=f"Trained value: {b_default_k}.",
+                )
+        with gr.Accordion("Inference settings", open=False):
+            with gr.Row():
+                aggregation_dd = gr.Dropdown(
+                    ["simple", "first", "max", "average", "none"],
+                    value="simple",
+                    label="aggregation_strategy",
+                    info="how token-level labels are merged into spans",
+                )
+                viterbi_cb = gr.Checkbox(
+                    value=True,
+                    label="use_viterbi_decode",
+                    info="constrained BIOES decoding (off = raw argmax)",
+                )
+                viterbi_replace_cb = gr.Checkbox(
+                    value=True,
+                    label="viterbi_replace_logits",
+                    info="when on, outputs.logits.argmax(-1) returns the Viterbi path",
+                )
+            min_score_sl = gr.Slider(
+                0.0, 1.0, value=0.0, step=0.01,
+                label="min confidence",
+                info="filter out spans with score below this threshold",
+            )
+            chunk_sl = gr.Slider(
+                0, 500_000, value=CHUNK_CHARS, step=10_000,
+                label="chunk size (chars)",
+                info="0 = single pass; otherwise split on paragraphs at this size. "
+                     "Model window ≈131k tokens (~500k chars).",
+            )
+        with gr.Row():
+            file_in = gr.File(label="PDF / text file", file_types=[".pdf", ".txt", ".md"])
+            text_in = gr.Textbox(
+                label="…or paste text",
+                lines=6,
+                placeholder=("Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, "
+                             "phone 415-555-0123, email sarah.johnson@example.com."),
+            )
+        run_btn = gr.Button("Detect PII", variant="primary")
+        gr.HTML(LEGEND_HTML)
+        with gr.Row():
+            a_out = gr.HighlightedText(
+                label="Model A detections",
+                color_map=PALETTE,
+                show_legend=False,
+                combine_adjacent=False,
+            )
+            b_out = gr.HighlightedText(
+                label="Model B detections",
+                color_map=PALETTE,
+                show_legend=False,
+                combine_adjacent=False,
+            )
+        diff_out = gr.Markdown("_Run a detection to see the diff / span list._")
+        extracted_out = gr.Textbox(
+            label="Extracted text (read-only)", lines=6, interactive=False,
+        )
+        run_btn.click(
+            run,
+            [file_in, text_in, device_dd,
+             model_a_tb, model_b_tb, use_a, use_b,
+             aggregation_dd, viterbi_cb, viterbi_replace_cb,
+             top_k_a_sl, top_k_b_sl,
+             min_score_sl, chunk_sl],
+            [a_out, b_out, diff_out, extracted_out],
+        )
+        clear_btn.click(clear_model_cache, None, diff_out)
+    return demo
+def parse_args() -> argparse.Namespace:
+    p = argparse.ArgumentParser(description="HarEmb PII — Gradio demo")
+    p.add_argument("--host", default="127.0.0.1", help="Bind address (default: 127.0.0.1)")
+    p.add_argument("--port", type=int, default=7860, help="Port (default: 7860)")
+    p.add_argument("--share", action="store_true", help="Create a public Gradio share link")
+    p.add_argument("--model-a", default="OpenMed/privacy-filter-nemotron",
+                   help="Model A path / HF repo "
+                        "(default: OpenMed/privacy-filter-nemotron — teacher)")
+    p.add_argument("--model-b", default=DEFAULT_MODEL,
+                   help="Model B path / HF repo "
+                        "(default: . — this checkpoint)")
+    return p.parse_args()
+if __name__ == "__main__":
+    args = parse_args()
+    build_ui(
+        default_model_a=args.model_a,
+        default_model_b=args.model_b,
+    ).launch(
+        server_name=args.host,
+        server_port=args.port,
+        share=args.share,
+        theme=gr.themes.Soft(),
+    )

benchmark.py ADDED Viewed

	@@ -0,0 +1,1351 @@

+"""
+benchmark.py — single self-contained reproducibility script for
+haremb-privacy-filter-opennemo.
+Run from inside this folder:
+    python benchmark.py                            # default: cuda if available
+    python benchmark.py --device cpu               # cpu fallback
+    python benchmark.py --eval-pct 0.5             # smaller slice
+    python benchmark.py --no-base                  # skip teacher download
+Produces, in `--out` (default ./):
+    infer.log           — sample inference timing + redaction example
+    compare.log         — aggregate + per-category metrics, this model vs
+                          OpenMed teacher (raw + viterbi streams), and
+                          token-level pairwise breakdown.
+    eval_summary.png    — bar charts of headline metrics + per-category
+                          span-F1 (this vs teacher).
+    eval_confusion.png  — token-level outcome breakdown on gold non-O
+                          positions (this vs teacher).
+    eval_performance.png — model-size / compute / memory / throughput
+                          comparison (this vs teacher), absolute + ratios.
+This script does not import from training code. It vendors the small set
+of helpers it needs (BIOES decoder, span builder, eval-set sampler,
+metrics aggregator) so the model folder is self-contained.
+"""
+from __future__ import annotations
+import argparse
+import ast
+import math
+import os
+import sys
+import time
+from collections import Counter, defaultdict
+from pathlib import Path
+from typing import Dict, List, Tuple
+import numpy as np
+import torch
+from datasets import load_dataset
+from torch.utils.data import DataLoader, Dataset
+from tqdm.auto import tqdm
+from transformers import AutoModelForTokenClassification, AutoTokenizer
+# ---------------------------------------------------------------------------
+# Constants
+# ---------------------------------------------------------------------------
+SOURCE_DATASET = "nvidia/Nemotron-PII"
+TEACHER = "OpenMed/privacy-filter-nemotron"
+# 55 Nemotron-PII categories, alphabetically sorted (matches the order used
+# when the model was trained, so id2label / label2id round-trip cleanly).
+NEMOTRON_CATEGORIES: List[str] = sorted([
+    "account_number", "age", "api_key", "bank_routing_number",
+    "biometric_identifier", "blood_type", "certificate_license_number",
+    "city", "company_name", "coordinate", "country", "county",
+    "credit_debit_card", "customer_id", "cvv", "date", "date_of_birth",
+    "date_time", "device_identifier", "education_level", "email",
+    "employee_id", "employment_status", "fax_number", "first_name",
+    "gender", "health_plan_beneficiary_number", "http_cookie", "ipv4",
+    "ipv6", "language", "last_name", "license_plate", "mac_address",
+    "medical_record_number", "national_id", "occupation", "password",
+    "phone_number", "pin", "political_view", "postcode", "race_ethnicity",
+    "religious_belief", "sexuality", "ssn", "state", "street_address",
+    "swift_bic", "tax_id", "time", "unique_id", "url", "user_name",
+    "vehicle_identifier",
+])
+def nemotron_native_label_space() -> Tuple[Dict[str, int], Dict[int, str]]:
+    """O at id 0, then {B, I, E, S}-{cat} for each cat in alphabetical order."""
+    label2id: Dict[str, int] = {"O": 0}
+    nxt = 1
+    for cat in NEMOTRON_CATEGORIES:
+        for prefix in ("B", "I", "E", "S"):
+            label2id[f"{prefix}-{cat}"] = nxt
+            nxt += 1
+    id2label: Dict[int, str] = {v: k for k, v in label2id.items()}
+    return label2id, id2label
+# ---------------------------------------------------------------------------
+# Span parsing + char-level token alignment (vendored from the training data
+# pipeline; identical logic, no training imports)
+# ---------------------------------------------------------------------------
+def _trim_span(text: str, start: int, end: int) -> Tuple[int, int]:
+    raw = text[start:end]
+    i = 0
+    while i < len(raw) and raw[i].isspace():
+        i += 1
+    j = len(raw)
+    while j > i and (raw[j - 1].isspace() or raw[j - 1] in ".,;:)"):
+        j -= 1
+    return start + i, start + j
+def _parse_spans(spans_str) -> List[dict]:
+    if isinstance(spans_str, list):
+        return spans_str
+    try:
+        return ast.literal_eval(spans_str)
+    except (SyntaxError, ValueError):
+        return []
+def _assign_native_bioes_labels(
+    text: str,
+    raw_spans: List[dict],
+    tokenizer,
+    max_length: int,
+    label2id: Dict[str, int],
+    min_overlap_frac: float = 0.5,
+) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+    pf_like: List[Tuple[int, int, str]] = []
+    for s in raw_spans:
+        cat = s.get("label")
+        if not cat:
+            continue
+        st, en = _trim_span(text, int(s["start"]), int(s["end"]))
+        if st >= en:
+            continue
+        pf_like.append((st, en, cat))
+    pf_like.sort(key=lambda x: (x[0], -x[1]))
+    enc = tokenizer(
+        text, truncation=True, max_length=max_length,
+        padding="max_length", return_offsets_mapping=True, return_tensors="pt",
+    )
+    input_ids = enc.input_ids[0]
+    attention_mask = enc.attention_mask[0]
+    offsets = enc.offset_mapping[0].tolist()
+    o_id = label2id["O"]
+    label_ids = [o_id] * len(input_ids)
+    locked = [False] * len(input_ids)
+    for span_start, span_end, cat in pf_like:
+        tok_indices: List[int] = []
+        for ti, (s, e) in enumerate(offsets):
+            if s == 0 and e == 0:
+                continue
+            if e <= span_start or s >= span_end:
+                continue
+            tok_len = e - s
+            if tok_len <= 0:
+                continue
+            overlap = min(e, span_end) - max(s, span_start)
+            if overlap / tok_len >= min_overlap_frac:
+                tok_indices.append(ti)
+        if not tok_indices:
+            continue
+        tok_indices = [ti for ti in tok_indices if not locked[ti]]
+        if not tok_indices:
+            continue
+        if len(tok_indices) == 1:
+            tag = f"S-{cat}"
+            if tag in label2id:
+                label_ids[tok_indices[0]] = label2id[tag]
+                locked[tok_indices[0]] = True
+        else:
+            b_tag, i_tag, e_tag = f"B-{cat}", f"I-{cat}", f"E-{cat}"
+            if b_tag in label2id:
+                label_ids[tok_indices[0]] = label2id[b_tag]
+                locked[tok_indices[0]] = True
+            for ti in tok_indices[1:-1]:
+                if i_tag in label2id:
+                    label_ids[ti] = label2id[i_tag]
+                    locked[ti] = True
+            if e_tag in label2id:
+                label_ids[tok_indices[-1]] = label2id[e_tag]
+                locked[tok_indices[-1]] = True
+    label_tensor = torch.tensor(label_ids, dtype=torch.long)
+    label_tensor[attention_mask == 0] = -100
+    return input_ids, attention_mask, label_tensor
+class _NemotronEvalDataset(Dataset):
+    def __init__(self, hf_split, tokenizer, label2id, max_length):
+        self.hf = hf_split
+        self.tok = tokenizer
+        self.l2i = label2id
+        self.maxlen = max_length
+    def __len__(self):
+        return len(self.hf)
+    def __getitem__(self, idx):
+        ex = self.hf[idx]
+        ids, mask, labels = _assign_native_bioes_labels(
+            ex["text"], _parse_spans(ex["spans"]),
+            self.tok, self.maxlen, self.l2i,
+        )
+        L = int(mask.sum().item())
+        return {
+            "input_ids": ids[:L].tolist(),
+            "labels": labels[:L].tolist(),
+            "valid_len": L,
+        }
+def _make_collate(pad_token_id, max_length):
+    def _c(batch):
+        ids_list = [list(ex["input_ids"])[:max_length] for ex in batch]
+        labels_list = [list(ex["labels"])[:max_length] for ex in batch]
+        max_len = max(len(x) for x in ids_list)
+        B = len(batch)
+        input_ids = torch.full((B, max_len), pad_token_id, dtype=torch.long)
+        attention_mask = torch.zeros((B, max_len), dtype=torch.long)
+        labels = torch.full((B, max_len), -100, dtype=torch.long)
+        for i, (ids, lab) in enumerate(zip(ids_list, labels_list)):
+            L = len(ids)
+            input_ids[i, :L] = torch.tensor(ids, dtype=torch.long)
+            attention_mask[i, :L] = 1
+            labels[i, :L] = torch.tensor(lab, dtype=torch.long)
+        return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}
+    return _c
+def _build_eval_streaming(test_split, target_n, chunk_size, seed) -> List[int]:
+    """Uniform chunked sampling, identical to the training-time eval split."""
+    n_total = len(test_split)
+    target_n = min(target_n, n_total)
+    if target_n <= 0:
+        return []
+    rng = np.random.RandomState(seed)
+    per_chunk = max(1, math.ceil(chunk_size * target_n / n_total))
+    selected: List[int] = []
+    for chunk_start in range(0, n_total, chunk_size):
+        if len(selected) >= target_n:
+            break
+        chunk_end = min(chunk_start + chunk_size, n_total)
+        n_in_chunk = chunk_end - chunk_start
+        n_to_pick = min(per_chunk, n_in_chunk, target_n - len(selected))
+        if n_to_pick <= 0:
+            break
+        offsets = rng.choice(n_in_chunk, size=n_to_pick, replace=False)
+        selected.extend(int(chunk_start + o) for o in offsets)
+    return sorted(selected[:target_n])
+# ---------------------------------------------------------------------------
+# BIOES → spans + metrics
+# ---------------------------------------------------------------------------
+def _bioes_to_spans(labels, id2label, o_id=0):
+    """Convert per-token BIOES label ids to a set of (start, end, cat)."""
+    spans = set()
+    cur_start = None
+    cur_cat = None
+    for i, lid in enumerate(labels):
+        lid = int(lid)
+        if lid == o_id or lid < 0:
+            if cur_start is not None:
+                spans.add((cur_start, i, cur_cat))
+                cur_start = None
+                cur_cat = None
+            continue
+        tag = id2label.get(lid, "O")
+        if tag == "O" or "-" not in tag:
+            if cur_start is not None:
+                spans.add((cur_start, i, cur_cat))
+                cur_start = None
+                cur_cat = None
+            continue
+        prefix, cat = tag.split("-", 1)
+        if prefix == "S":
+            if cur_start is not None:
+                spans.add((cur_start, i, cur_cat))
+            spans.add((i, i + 1, cat))
+            cur_start = None
+            cur_cat = None
+        elif prefix == "B":
+            if cur_start is not None:
+                spans.add((cur_start, i, cur_cat))
+            cur_start = i
+            cur_cat = cat
+        elif prefix == "I":
+            if cur_start is None or cur_cat != cat:
+                if cur_start is not None:
+                    spans.add((cur_start, i, cur_cat))
+                cur_start = i
+                cur_cat = cat
+        elif prefix == "E":
+            if cur_start is None or cur_cat != cat:
+                if cur_start is not None:
+                    spans.add((cur_start, i, cur_cat))
+                spans.add((i, i + 1, cat))
+                cur_start = None
+                cur_cat = None
+            else:
+                spans.add((cur_start, i + 1, cur_cat))
+                cur_start = None
+                cur_cat = None
+    if cur_start is not None:
+        spans.add((cur_start, len(labels), cur_cat))
+    return spans
+def _aggregate_span_metrics(gold_spans, pred_spans):
+    correct = gold_spans & pred_spans
+    n_gold = len(gold_spans)
+    n_pred = len(pred_spans)
+    n_correct = len(correct)
+    p = n_correct / n_pred if n_pred else 0.0
+    r = n_correct / n_gold if n_gold else 0.0
+    f1 = (2 * p * r / (p + r)) if (p + r) else 0.0
+    per_cat: Dict[str, dict] = {}
+    cats = sorted({c for _, _, c in gold_spans} | {c for _, _, c in pred_spans})
+    for cat in cats:
+        g_c = {s for s in gold_spans if s[2] == cat}
+        p_c = {s for s in pred_spans if s[2] == cat}
+        c_c = g_c & p_c
+        pp = len(c_c) / len(p_c) if p_c else 0.0
+        rr = len(c_c) / len(g_c) if g_c else 0.0
+        ff = (2 * pp * rr / (pp + rr)) if (pp + rr) else 0.0
+        per_cat[cat] = {"precision": pp, "recall": rr, "f1": ff,
+                        "n_gold": len(g_c), "n_pred": len(p_c), "n_correct": len(c_c)}
+    return {"precision": p, "recall": r, "f1": f1,
+            "n_gold": n_gold, "n_pred": n_pred, "n_correct": n_correct,
+            "per_cat": per_cat}
+def _stream_metrics(docs, stream, id2label, o_id):
+    """Aggregate metrics over a list of {gold, raw, viterbi} per-doc dicts."""
+    n_tokens = correct = n_non_o = non_o_correct = 0
+    gold_spans_all: set = set()
+    pred_spans_all: set = set()
+    doc_offset = 0
+    for doc in docs:
+        gold = [int(x) for x in doc["gold"]]
+        pred = [int(x) for x in doc[stream]]
+        n = len(gold)
+        n_tokens += n
+        for g, p in zip(gold, pred):
+            n_non_o += int(g != o_id)
+            if g == p:
+                correct += 1
+                if g != o_id:
+                    non_o_correct += 1
+        gs = _bioes_to_spans(gold, id2label, o_id)
+        ps = _bioes_to_spans(pred, id2label, o_id)
+        gold_spans_all.update((doc_offset + s, doc_offset + e, c) for s, e, c in gs)
+        pred_spans_all.update((doc_offset + s, doc_offset + e, c) for s, e, c in ps)
+        doc_offset += n
+    span_m = _aggregate_span_metrics(gold_spans_all, pred_spans_all)
+    return {
+        "n_tokens": n_tokens,
+        "n_non_o": n_non_o,
+        "token_acc": correct / n_tokens if n_tokens else 0.0,
+        "non_o_recall": non_o_correct / n_non_o if n_non_o else 0.0,
+        "span_precision": span_m["precision"],
+        "span_recall":    span_m["recall"],
+        "span_f1":        span_m["f1"],
+        "n_gold_spans":   span_m["n_gold"],
+        "n_pred_spans":   span_m["n_pred"],
+        "n_correct_spans": span_m["n_correct"],
+        "span_per_cat":   span_m["per_cat"],
+    }
+# ---------------------------------------------------------------------------
+# Forward pass + viterbi (delegates to the released modeling)
+# ---------------------------------------------------------------------------
+def _model_perf_stats(model, dtype) -> dict:
+    """Total / active / compute / MoE breakdown + on-device byte size.
+    Three distinct param counts:
+      * total_params           — every parameter the model has on disk / in RAM.
+      * active_params_per_tok  — params *touched* during one token's forward
+                                 pass (memory footprint per token). Counts
+                                 the embedding because the embedding row
+                                 IS read per token; counts only top-k of
+                                 num_experts MoE experts because routing is
+                                 sparse.
+      * compute_params_per_tok — params that contribute matmul FLOPs per
+                                 token. EXCLUDES the embedding table:
+                                 `embed_tokens.weight` is a gather (one row
+                                 read), not a matmul, so its FLOP cost is
+                                 negligible (~hidden_size ops vs the table
+                                 having ~vocab × hidden params). Counting
+                                 it via the standard "2 × params" matmul
+                                 approximation hugely inflates the apparent
+                                 GFLOP/token and compresses the ratio between
+                                 deep and shallow models.
+    GFLOP/token is computed from `compute_params_per_tok`, not from
+    `active_params_per_tok`. This makes the metric reflect actual layer-wise
+    computational cost.
+    """
+    cfg = model.config
+    num_experts = int(getattr(cfg, "num_local_experts", 1))
+    top_k = int(getattr(cfg, "num_experts_per_tok", num_experts))
+    expert_frac = top_k / max(1, num_experts)
+    moe_total = 0
+    moe_active = 0
+    other_total = 0
+    embed_total = 0  # gather-only params; excluded from FLOP estimate
+    for name, p in model.named_parameters():
+        n = p.numel()
+        # MoE expert tensors are stacked along an experts axis. The upstream
+        # exposes them under `mlp.experts.*`. Only `top_k` of `num_experts`
+        # experts contribute per token.
+        if ".mlp.experts." in name:
+            moe_total += n
+            moe_active += int(round(n * expert_frac))
+        # `embed_tokens.weight` (and any other lookup-style table) is a
+        # gather: one row of [vocab, hidden] is read per token, costing
+        # ~hidden ops, not 2 × vocab × hidden FLOPs. Tracked separately
+        # so it doesn't pollute the FLOP estimate.
+        elif "embed_tokens" in name:
+            embed_total += n
+            other_total += n
+        else:
+            other_total += n
+    total = moe_total + other_total
+    active_per_tok = moe_active + other_total
+    # Compute-relevant params per token: matmuls only. Drop the embedding
+    # lookup, which contributes ~zero FLOPs.
+    compute_per_tok = active_per_tok - embed_total
+    bytes_per_param = {torch.bfloat16: 2, torch.float16: 2, torch.float32: 4}.get(dtype, 4)
+    storage_bytes = total * bytes_per_param  # in-memory dense weights
+    # GFLOP/token over matmul params only. Matmul ≈ 2 FLOPs per param.
+    gflops_per_tok = 2 * compute_per_tok / 1e9
+    return {
+        "total_params": total,
+        "moe_total": moe_total,
+        "moe_active_per_tok": moe_active,
+        "other_total": other_total,
+        "embed_total": embed_total,
+        "active_params_per_tok": active_per_tok,
+        "compute_params_per_tok": compute_per_tok,
+        "num_experts": num_experts,
+        "experts_per_tok": top_k,
+        "expert_frac": expert_frac,
+        "weight_bytes": storage_bytes,
+        "gflops_per_tok": gflops_per_tok,
+    }
+def _disk_size_bytes(model_path: str) -> int:
+    """Sum on-disk size of weight files at the given path. Falls back to 0
+    if the path is a HF repo id (not a local directory)."""
+    p = Path(model_path)
+    if not p.is_dir():
+        return 0
+    total = 0
+    for f in p.iterdir():
+        if f.is_file() and f.suffix in {".safetensors", ".bin", ".pt", ".pth"}:
+            total += f.stat().st_size
+    return total
+def _eval_one_model(
+    model_path: str, tokenizer, eval_ds, label2id, id2label, o_id,
+    bioes_trans, bioes_init, batch_size, max_length, device, dtype,
+    label: str,
+):
+    print(f"[eval] loading {label} from {model_path} ...", flush=True)
+    if torch.cuda.is_available() and device.type == "cuda":
+        torch.cuda.reset_peak_memory_stats(device)
+        torch.cuda.empty_cache()
+    mem_before = (torch.cuda.memory_allocated(device)
+                  if torch.cuda.is_available() and device.type == "cuda" else 0)
+    t_load = time.time()
+    model = AutoModelForTokenClassification.from_pretrained(
+        model_path, dtype=dtype, trust_remote_code=True,
+    ).to(device).eval()
+    if hasattr(model.config, "use_viterbi_decode"):
+        model.config.use_viterbi_decode = True
+    if hasattr(model.config, "viterbi_replace_logits"):
+        model.config.viterbi_replace_logits = False
+    load_s = time.time() - t_load
+    perf = _model_perf_stats(model, dtype)
+    perf["disk_size_bytes"] = _disk_size_bytes(model_path)
+    weights_resident_bytes = (
+        torch.cuda.memory_allocated(device) - mem_before
+        if torch.cuda.is_available() and device.type == "cuda" else perf["weight_bytes"]
+    )
+    perf["weights_resident_bytes"] = weights_resident_bytes
+    # Vendor the batched viterbi (re-import from the released modeling file).
+    from modeling_haremb_pii import _bioes_viterbi_batched
+    pad_token_id = tokenizer.pad_token_id or 199999
+    loader = DataLoader(
+        eval_ds, batch_size=batch_size, shuffle=False,
+        collate_fn=_make_collate(pad_token_id, max_length), num_workers=2,
+    )
+    docs: List[dict] = []
+    n_tok = 0
+    if torch.cuda.is_available() and device.type == "cuda":
+        torch.cuda.synchronize()
+        torch.cuda.reset_peak_memory_stats(device)
+    t0 = time.time()
+    for batch in tqdm(loader, desc=f"eval {label}", unit="batch", leave=False):
+        ids = batch["input_ids"].to(device, non_blocking=True)
+        mask = batch["attention_mask"].to(device, non_blocking=True)
+        gold = batch["labels"].to(device, non_blocking=True)
+        with torch.no_grad():
+            out = model(input_ids=ids, attention_mask=mask)
+        raw = out.logits.argmax(dim=-1)
+        vit = _bioes_viterbi_batched(out.logits.float(), mask, bioes_trans, bioes_init)
+        valid = (gold != -100) & mask.bool()
+        for b in range(gold.shape[0]):
+            keep = [i for i, ok in enumerate(valid[b].cpu().tolist()) if ok]
+            n_tok += len(keep)
+            docs.append({
+                "gold": [int(gold[b, i].item()) for i in keep],
+                "raw":  [int(raw[b, i].item())  for i in keep],
+                "viterbi": [int(vit[b, i].item()) for i in keep],
+            })
+    if torch.cuda.is_available() and device.type == "cuda":
+        torch.cuda.synchronize()
+        peak_mem = torch.cuda.max_memory_allocated(device)
+    else:
+        peak_mem = 0
+    eval_s = time.time() - t0
+    raw_m = _stream_metrics(docs, "raw", id2label, o_id)
+    vit_m = _stream_metrics(docs, "viterbi", id2label, o_id)
+    perf["peak_eval_mem_bytes"] = peak_mem
+    del model
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+    return {
+        "label": label,
+        "n_total_M": perf["total_params"] / 1e6,
+        "load_s": load_s,
+        "eval_s": eval_s,
+        "n_tok": n_tok,
+        "throughput_tok_s": n_tok / eval_s if eval_s else 0.0,
+        "perf": perf,
+        "raw": raw_m,
+        "viterbi": vit_m,
+        "docs": docs,
+    }
+# ---------------------------------------------------------------------------
+# Pairwise token-level breakdown (this vs reference, viterbi stream)
+# ---------------------------------------------------------------------------
+def _pairwise(docs_cand, docs_ref, id2label, o_id):
+    both_correct = only_cand = only_ref = both_wrong = 0
+    by_cat = defaultdict(lambda: {"both_correct": 0, "only_cand": 0, "only_ref": 0, "both_wrong": 0})
+    for dc, dr in zip(docs_cand, docs_ref):
+        gold = dc["gold"]
+        cv = dc["viterbi"]
+        rv = dr["viterbi"]
+        for g, c, r in zip(gold, cv, rv):
+            cat = id2label.get(g, "O")
+            cat = cat.split("-", 1)[1] if "-" in cat else cat
+            cc = (c == g)
+            rc = (r == g)
+            if cc and rc:
+                both_correct += 1
+                by_cat[cat]["both_correct"] += 1
+            elif cc and not rc:
+                only_cand += 1
+                by_cat[cat]["only_cand"] += 1
+            elif rc and not cc:
+                only_ref += 1
+                by_cat[cat]["only_ref"] += 1
+            else:
+                both_wrong += 1
+                by_cat[cat]["both_wrong"] += 1
+    return {
+        "both_correct": both_correct,
+        "only_cand_correct": only_cand,
+        "only_ref_correct": only_ref,
+        "both_wrong": both_wrong,
+        "by_cat": dict(by_cat),
+    }
+# ---------------------------------------------------------------------------
+# Plot rendering
+# ---------------------------------------------------------------------------
+def _render_plots(cand, ref, pair, out_dir: Path, cand_label, ref_label):
+    """Render benchmark plots.
+    Visual convention:
+      A = reference / teacher / baseline
+      B = candidate / this checkpoint
+    The charts avoid color-only "win" encoding: labels state the actual delta
+    or ratio, and horizontal layouts keep long metric/category names readable.
+    """
+    try:
+        import matplotlib
+        matplotlib.use("Agg")
+        import matplotlib.pyplot as plt
+        from matplotlib.ticker import FuncFormatter
+    except ImportError:
+        print("[plot] matplotlib not installed, skipping", flush=True)
+        return
+    plt.rcParams.update({
+        "figure.facecolor": "#ffffff",
+        "axes.facecolor": "#ffffff",
+        "axes.edgecolor": "#cbd5e1",
+        "axes.labelcolor": "#0f172a",
+        "xtick.color": "#334155",
+        "ytick.color": "#334155",
+        "grid.color": "#e2e8f0",
+        "font.size": 9,
+        "axes.titleweight": "bold",
+        "axes.titlesize": 11,
+        "legend.frameon": False,
+    })
+    C_REF = "#64748b"      # slate
+    C_CAND = "#2563eb"     # blue
+    C_GOOD = "#0f766e"     # teal
+    C_BAD = "#b91c1c"      # red
+    C_NEUTRAL = "#94a3b8"  # light slate
+    C_BG = "#f8fafc"
+    def _pct_axis(ax):
+        ax.xaxis.set_major_formatter(FuncFormatter(lambda x, _pos: f"{x:.0%}"))
+    def _metric_delta_text(delta):
+        return f"{delta:+.4f}"
+    def _value_text(v):
+        if v >= 100:
+            return f"{v:,.0f}"
+        if v >= 10:
+            return f"{v:,.1f}"
+        if v >= 1:
+            return f"{v:,.2f}"
+        return f"{v:,.3f}"
+    def _ratio_text(r, lower_is_better):
+        if r <= 0:
+            return "n/a"
+        if lower_is_better:
+            return f"{1.0 / r:.2f}x lower" if r <= 1 else f"{r:.2f}x higher"
+        return f"{r:.2f}x higher" if r >= 1 else f"{1.0 / r:.2f}x lower"
+    # --- eval_summary.png: headline metrics + category-level deltas ---
+    fig, axes = plt.subplots(2, 2, figsize=(8, 5), constrained_layout=True)
+    ax_head = axes[0, 0]
+    ax_delta = axes[0, 1]
+    ax_raw_vit = axes[1, 0]
+    ax_cat = axes[1, 1]
+    metrics = ["span_f1", "span_precision", "span_recall", "token_acc", "non_o_recall"]
+    labels = ["Span F1", "Span P", "Span R", "Token acc", "Non-O recall"]
+    cand_v = [cand["viterbi"][m] for m in metrics]
+    ref_v  = [ref["viterbi"][m] for m in metrics] if ref is not None else None
+    y = np.arange(len(metrics))
+    if ref_v is not None:
+        ax_head.hlines(y, ref_v, cand_v, color=C_NEUTRAL, linewidth=2, alpha=0.9)
+        ax_head.scatter(ref_v, y, s=55, color=C_REF, label=f"A: {ref_label}", zorder=3)
+    ax_head.scatter(cand_v, y, s=70, color=C_CAND, label=f"B: {cand_label}", zorder=4)
+    ax_head.set_yticks(y)
+    ax_head.set_yticklabels(labels)
+    ax_head.invert_yaxis()
+    ax_head.set_xlim(max(0.0, min(cand_v + (ref_v or cand_v)) - 0.02),
+                     min(1.08, max(cand_v + (ref_v or cand_v)) + 0.05))
+    ax_head.set_title("Headline metrics, Viterbi stream")
+    ax_head.grid(axis="x", alpha=0.7)
+    _pct_axis(ax_head)
+    if ref_v is not None:
+        for i, v in enumerate(ref_v):
+            ax_head.text(v + 0.002, i, f"{v:.4f}", va="center", ha="left",
+                         fontsize=7, color=C_REF)
+    for i, v in enumerate(cand_v):
+        ax_head.text(v - 0.002, i, f"{v:.4f}", va="center", ha="right",
+                     fontsize=7, color=C_CAND)
+    if ref_v is not None:
+        deltas = [b - a for a, b in zip(ref_v, cand_v)]
+        colors = [C_GOOD if d >= 0 else C_BAD for d in deltas]
+        ax_delta.axvline(0, color="#0f172a", linewidth=0.9)
+        ax_delta.barh(y, deltas, color=colors, alpha=0.9)
+        ax_delta.set_yticks(y)
+        ax_delta.set_yticklabels(labels)
+        ax_delta.invert_yaxis()
+        ax_delta.set_title("Delta: B minus A")
+        ax_delta.grid(axis="x", alpha=0.7)
+        max_abs = max([abs(d) for d in deltas] + [0.002])
+        ax_delta.set_xlim(-max_abs * 1.45, max_abs * 1.45)
+        for i, d in enumerate(deltas):
+            ax_delta.text(d + (max_abs * 0.04 if d >= 0 else -max_abs * 0.04), i, _metric_delta_text(d),
+                          ha="left" if d >= 0 else "right", va="center", fontsize=7)
+    else:
+        ax_delta.axis("off")
+    stream_rows = [
+        ("A raw", ref["raw"]["span_f1"] if ref is not None else None, C_REF),
+        ("A viterbi", ref["viterbi"]["span_f1"] if ref is not None else None, C_REF),
+        ("B raw", cand["raw"]["span_f1"], C_CAND),
+        ("B viterbi", cand["viterbi"]["span_f1"], C_CAND),
+    ]
+    stream_rows = [r for r in stream_rows if r[1] is not None]
+    sy = np.arange(len(stream_rows))
+    ax_raw_vit.barh(sy, [r[1] for r in stream_rows], color=[r[2] for r in stream_rows], alpha=0.88)
+    ax_raw_vit.set_yticks(sy)
+    ax_raw_vit.set_yticklabels([r[0] for r in stream_rows])
+    ax_raw_vit.invert_yaxis()
+    ax_raw_vit.set_xlim(0, 1.08)
+    ax_raw_vit.set_title("Raw vs Viterbi span F1")
+    ax_raw_vit.grid(axis="x", alpha=0.7)
+    _pct_axis(ax_raw_vit)
+    for i, (_, v, _) in enumerate(stream_rows):
+        ax_raw_vit.text(v + 0.008, i, f"{v:.4f}", va="center", fontsize=7)
+    cand_pc = cand["viterbi"]["span_per_cat"]
+    if ref is not None:
+        ref_pc = ref["viterbi"]["span_per_cat"]
+        cats = sorted(set(cand_pc) | set(ref_pc))
+        rows = []
+        for c in cats:
+            a = ref_pc.get(c, {}).get("f1", 0.0)
+            b = cand_pc.get(c, {}).get("f1", 0.0)
+            n = max(cand_pc.get(c, {}).get("n_gold", 0), ref_pc.get(c, {}).get("n_gold", 0))
+            rows.append((c, b - a, b, a, n))
+        # Keep the categories that explain the comparison: worst and best B deltas.
+        worst = sorted(rows, key=lambda r: r[1])[:8]
+        best = sorted(rows, key=lambda r: r[1], reverse=True)[:8]
+        picked = worst + [r for r in best if r[0] not in {x[0] for x in worst}]
+        picked = sorted(picked, key=lambda r: r[1])
+        cy = np.arange(len(picked))
+        deltas = [r[1] for r in picked]
+        ax_cat.axvline(0, color="#0f172a", linewidth=0.9)
+        ax_cat.barh(cy, deltas, color=[C_GOOD if d >= 0 else C_BAD for d in deltas])
+        ax_cat.set_yticks(cy)
+        ax_cat.set_yticklabels([r[0] for r in picked], fontsize=8)
+        ax_cat.set_title("Per-category span F1 delta, selected extremes")
+        ax_cat.grid(axis="x", alpha=0.7)
+        max_abs = max([abs(d) for d in deltas] + [0.05])
+        ax_cat.set_xlim(-max_abs * 1.55, max_abs * 1.55)
+        for i, r in enumerate(picked):
+            d = r[1]
+            ax_cat.text(d + (max_abs * 0.05 if d >= 0 else -max_abs * 0.05), i,
+                        f"{d:+.3f}  B={r[2]:.2f} A={r[3]:.2f}",
+                        va="center", ha="left" if d >= 0 else "right", fontsize=6)
+    else:
+        cats_sorted = sorted(cand_pc.keys(), key=lambda c: cand_pc[c]["f1"])[:18]
+        vals = [cand_pc[c]["f1"] for c in cats_sorted]
+        cy = np.arange(len(cats_sorted))
+        ax_cat.barh(cy, vals, color=C_CAND)
+        ax_cat.set_yticks(cy)
+        ax_cat.set_yticklabels(cats_sorted, fontsize=8)
+        ax_cat.set_xlim(0, 1.0)
+        ax_cat.set_title("Lowest per-category span F1")
+        ax_cat.grid(axis="x", alpha=0.7)
+        _pct_axis(ax_cat)
+    fig.suptitle(f"Evaluation summary — A: {ref_label if ref else 'n/a'}  |  B: {cand_label}",
+                 fontsize=9, fontweight="bold")
+    fig.savefig(out_dir / "eval_summary.png", dpi=160)
+    plt.close(fig)
+    print(f"[plot] wrote {out_dir / 'eval_summary.png'}", flush=True)
+    # --- eval_confusion.png: pairwise outcome on gold non-O tokens ---
+    # Display order matches A vs B: "Only A correct" (teacher) before
+    # "Only B correct" (student). Underlying buckets in `pair` are still
+    # named cand/ref; we just relabel for display.
+    if pair is not None:
+        fig, axes = plt.subplots(
+            1, 2, figsize=(8, 3), constrained_layout=True,
+            gridspec_kw={"width_ratios": [0.9, 1.7]},
+        )
+        ax = axes[0]
+        non_o_buckets = {k: 0 for k in ["both_correct", "only_cand", "only_ref", "both_wrong"]}
+        for cat, d in pair["by_cat"].items():
+            if cat == "O":
+                continue
+            for k in non_o_buckets:
+                non_o_buckets[k] += d[k]
+        values = [non_o_buckets["both_correct"], non_o_buckets["only_ref"],
+                  non_o_buckets["only_cand"], non_o_buckets["both_wrong"]]
+        labels_ = [
+            "Both\ncorrect",
+            "Only A\ncorrect",
+            "Only B\ncorrect",
+            "Both wrong",
+        ]
+        colors = [C_GOOD, C_REF, C_CAND, C_BAD]
+        total = max(1, sum(values))
+        ax.barh(np.arange(4), values, color=colors)
+        ax.set_yticks(np.arange(4))
+        ax.set_yticklabels(labels_)
+        ax.invert_yaxis()
+        ax.set_ylabel("Gold non-O tokens")
+        ax.set_title("Token outcome on gold non-O")
+        ax.grid(axis="x", alpha=0.7)
+        ax.set_xlim(0, max(values) * 1.32 if values else 1)
+        for i, v in enumerate(values):
+            ax.text(v + max(values) * 0.015, i, f"{v:,}  ({v / total:.1%})",
+                    va="center", fontsize=6)
+        rows = []
+        for cat, d in pair["by_cat"].items():
+            if cat == "O":
+                continue
+            net = d["only_cand"] - d["only_ref"]
+            active = d["only_cand"] + d["only_ref"] + d["both_wrong"]
+            if active:
+                rows.append((cat, net, d["only_cand"], d["only_ref"], d["both_wrong"]))
+        worst = sorted(rows, key=lambda r: r[1])[:8]
+        best = sorted(rows, key=lambda r: r[1], reverse=True)[:8]
+        picked = worst + [r for r in best if r[0] not in {x[0] for x in worst}]
+        picked = sorted(picked, key=lambda r: r[1])
+        ax2 = axes[1]
+        if picked:
+            py = np.arange(len(picked))
+            nets = [r[1] for r in picked]
+            ax2.axvline(0, color="#0f172a", linewidth=0.9)
+            ax2.barh(py, nets, color=[C_GOOD if n >= 0 else C_BAD for n in nets])
+            ax2.set_yticks(py)
+            ax2.set_yticklabels([r[0] for r in picked], fontsize=8)
+            ax2.set_title("Net token wins by category: B only-correct minus A only-correct")
+            ax2.grid(axis="x", alpha=0.7)
+            max_abs = max([abs(n) for n in nets] + [1])
+            ax2.set_xlim(-max_abs * 1.5, max_abs * 1.5)
+            for i, r in enumerate(picked):
+                n = r[1]
+                label_x = n + max_abs * 0.04 if n >= 0 else max_abs * 0.05
+                ax2.text(label_x, i,
+                         f"{n:+d}  B={r[2]} A={r[3]} W={r[4]}",
+                         va="center", ha="left", fontsize=6)
+        else:
+            ax2.axis("off")
+        fig.suptitle(f"Pairwise correctness — A: {ref_label}  |  B: {cand_label}",
+                     fontsize=9, fontweight="bold")
+        fig.savefig(out_dir / "eval_confusion.png", dpi=160)
+        plt.close(fig)
+        print(f"[plot] wrote {out_dir / 'eval_confusion.png'}", flush=True)
+    # --- eval_performance.png: model size, compute, throughput, memory ---
+    if ref is not None and "perf" in cand and "perf" in ref:
+        cp, rp = cand["perf"], ref["perf"]
+        fig, axes = plt.subplots(1, 2, figsize=(8.5, 5), constrained_layout=True)
+        ax_abs = axes[0]
+        ax_ratio = axes[1]
+        metrics = [
+            ("Total params (M)",       cp["total_params"]/1e6,            rp["total_params"]/1e6,            True),
+            ("Active params/tok (M)",  cp["active_params_per_tok"]/1e6,   rp["active_params_per_tok"]/1e6,   True),
+            ("MoE expert params (M)",  cp["moe_total"]/1e6,               rp["moe_total"]/1e6,               True),
+            ("GFLOP/token",            cp["gflops_per_tok"],              rp["gflops_per_tok"],              True),
+            ("Weights RAM (MiB)",      cp["weight_bytes"]/(1<<20),        rp["weight_bytes"]/(1<<20),        True),
+            ("Peak eval mem (MiB)",    cp["peak_eval_mem_bytes"]/(1<<20), rp["peak_eval_mem_bytes"]/(1<<20), True),
+            ("Throughput (tok/s)",     cand["throughput_tok_s"],          ref["throughput_tok_s"],           False),
+        ]
+        y = np.arange(len(metrics))
+        w = 0.38
+        cand_v = [m[1] for m in metrics]
+        ref_v  = [m[2] for m in metrics]
+        ax_abs.barh(y - w/2, ref_v,  w, label=f"A: {ref_label}",  color=C_REF)
+        ax_abs.barh(y + w/2, cand_v, w, label=f"B: {cand_label}", color=C_CAND)
+        ax_abs.set_yticks(y)
+        ax_abs.set_yticklabels([m[0] for m in metrics], fontsize=8)
+        ax_abs.invert_yaxis()
+        ax_abs.set_xscale("log")
+        ax_abs.set_title("Absolute footprint and speed, log scale")
+        ax_abs.grid(axis="x", which="both", alpha=0.7)
+        positive_vals = [v for v in (ref_v + cand_v) if v > 0]
+        if positive_vals:
+            ax_abs.set_xlim(min(positive_vals) * 0.55, max(positive_vals) * 3.8)
+        for yi, vals in enumerate(zip(ref_v, cand_v)):
+            for off, v, col in [(-w/2, vals[0], C_REF), (w/2, vals[1], C_CAND)]:
+                if v <= 0:
+                    continue
+                ax_abs.text(v * 1.05, yi + off, _value_text(v), va="center", fontsize=6, color=col)
+        ratios = [(m[1] / max(1e-12, m[2])) for m in metrics]
+        lower_better = [m[3] for m in metrics]
+        colors = [
+            C_GOOD if ((lb and r <= 1.0) or ((not lb) and r >= 1.0)) else C_BAD
+            for r, lb in zip(ratios, lower_better)
+        ]
+        ax_ratio.axvline(1.0, color="#0f172a", linestyle="--", linewidth=0.9, alpha=0.65)
+        ax_ratio.barh(y, ratios, color=colors)
+        ax_ratio.set_yticks(y)
+        ax_ratio.set_yticklabels([m[0] for m in metrics], fontsize=8)
+        ax_ratio.invert_yaxis()
+        ax_ratio.set_xscale("log")
+        ax_ratio.set_title("B / A ratio with explicit direction")
+        ax_ratio.grid(axis="x", which="both", alpha=0.7)
+        positive_ratios = [r for r in ratios if r > 0]
+        if positive_ratios:
+            ax_ratio.set_xlim(min(positive_ratios) * 0.38, max(positive_ratios) * 2.8)
+        for i, (r, lb) in enumerate(zip(ratios, lower_better)):
+            ax_ratio.text(r * (1.05 if r >= 1 else 0.95), i, _ratio_text(r, lb),
+                          va="center", ha="left" if r >= 1 else "right", fontsize=6)
+        fig.suptitle(f"Performance profile — A: {ref_label}  |  B: {cand_label}",
+                     fontsize=9, fontweight="bold")
+        fig.savefig(out_dir / "eval_performance.png", dpi=160)
+        plt.close(fig)
+        print(f"[plot] wrote {out_dir / 'eval_performance.png'}", flush=True)
+# ---------------------------------------------------------------------------
+# Reporting
+# ---------------------------------------------------------------------------
+def _fmt_metrics(m):
+    return (f"span_F1={m['span_f1']:.4f}  P={m['span_precision']:.4f}  "
+            f"R={m['span_recall']:.4f}  token_acc={m['token_acc']:.4f}  "
+            f"non_o_recall={m['non_o_recall']:.4f}  "
+            f"spans={m['n_gold_spans']}/{m['n_pred_spans']}/{m['n_correct_spans']}")
+def _write_compare_log(path: Path, cand, ref, pair, args):
+    lines: List[str] = []
+    A = lines.append
+    A(f"Benchmark: A: {ref['label']} vs B: {cand['label']}"
+      if ref else f"Benchmark: {cand['label']}")
+    A(f"Dataset:   {SOURCE_DATASET}, split=test, eval_pct={args.eval_pct}, "
+      f"ctx={args.max_length}, seed={args.seed}, n_docs={args.n_docs}")
+    A(f"Eval tokens scored: {cand['n_tok']:,}")
+    A("")
+    A("=== Aggregate ===")
+    if ref is not None:
+        A(f"  A: {ref['label']:<25s} RAW     {_fmt_metrics(ref['raw'])}")
+        A(f"  A: {ref['label']:<25s} VITERBI {_fmt_metrics(ref['viterbi'])}")
+    A(f"  B: {cand['label']:<25s} RAW     {_fmt_metrics(cand['raw'])}")
+    A(f"  B: {cand['label']:<25s} VITERBI {_fmt_metrics(cand['viterbi'])}")
+    if ref is not None:
+        d_f1 = ref["viterbi"]["span_f1"] - cand["viterbi"]["span_f1"]
+        A("")
+        A(f"Gap B vs A (viterbi span_F1): {-d_f1:+.4f}")
+    A("")
+    if ref is not None:
+        A(f"Throughput: A: {ref['label']} {ref['throughput_tok_s']:.0f} tok/s  "
+          f"({ref['n_total_M']:.2f}M params)")
+        A(f"            B: {cand['label']} {cand['throughput_tok_s']:.0f} tok/s  "
+          f"({cand['n_total_M']:.2f}M params)")
+    else:
+        A(f"Throughput: {cand['label']} {cand['throughput_tok_s']:.0f} tok/s  "
+          f"({cand['n_total_M']:.2f}M params)")
+    A("")
+    # ---- Performance summary table ----
+    # Column order: A (ref / teacher) first, then B (cand / student).
+    # The "B vs A" column is written in human-readable direction:
+    #   - When B is smaller (size/compute/mem): "X.XX× smaller" / "X.XX× cheaper" / "X.XX× less".
+    #   - When B is larger but lower-is-better: "X.XX× larger" / "X.XX× more".
+    #   - When B is faster (throughput, higher-is-better): "X.XX× faster".
+    #   - When B is slower: "X.XX× slower".
+    # Always uses the magnitude in the dominant direction so the reader
+    # doesn't need to mentally invert 0.21× into 4.87×.
+    def _fmt_vs(b: float, a: float, kind: str) -> str:
+        if a is None or a == 0 or b is None or b == 0:
+            return ""
+        ratio = b / a
+        if kind == "size":  # lower is better; phrase as "smaller" or "larger"
+            if ratio <= 1.0:
+                return f"{1.0/ratio:.2f}× smaller"
+            return f"{ratio:.2f}× larger"
+        if kind == "compute":
+            if ratio <= 1.0:
+                return f"{1.0/ratio:.2f}× cheaper"
+            return f"{ratio:.2f}× more"
+        if kind == "memory":
+            if ratio <= 1.0:
+                return f"{1.0/ratio:.2f}× less"
+            return f"{ratio:.2f}× more"
+        if kind == "speed":  # higher is better
+            if ratio >= 1.0:
+                return f"{ratio:.2f}× faster"
+            return f"{1.0/ratio:.2f}× slower"
+        return f"{ratio:.2f}×"
+    cp = cand["perf"]
+    A("=== Performance ===")
+    headers = ["metric",
+               f"A: {ref['label']}" if ref else "",
+               f"B: {cand['label']}",
+               "B vs A"]
+    rp = ref["perf"] if ref else None
+    rows = [
+        ["total params (M)",
+         (f"{rp['total_params']/1e6:.2f}" if ref else ""),
+         f"{cp['total_params']/1e6:.2f}",
+         (_fmt_vs(cp['total_params'], rp['total_params'], "size") if ref else "")],
+        ["dense params (M)",
+         (f"{rp['other_total']/1e6:.2f}" if ref else ""),
+         f"{cp['other_total']/1e6:.2f}",
+         (_fmt_vs(cp['other_total'], rp['other_total'], "size") if ref else "")],
+        ["MoE expert params (M)",
+         (f"{rp['moe_total']/1e6:.2f}" if ref else ""),
+         f"{cp['moe_total']/1e6:.2f}",
+         (_fmt_vs(cp['moe_total'], rp['moe_total'], "size") if ref else "")],
+        [f"active params/token (M, mem)",
+         (f"{rp['active_params_per_tok']/1e6:.2f}" if ref else ""),
+         f"{cp['active_params_per_tok']/1e6:.2f}",
+         (_fmt_vs(cp['active_params_per_tok'], rp['active_params_per_tok'], "memory") if ref else "")],
+        [f"compute params/token (M, FLOPs)",
+         (f"{rp['compute_params_per_tok']/1e6:.2f}" if ref else ""),
+         f"{cp['compute_params_per_tok']/1e6:.2f}",
+         (_fmt_vs(cp['compute_params_per_tok'], rp['compute_params_per_tok'], "compute") if ref else "")],
+        ["GFLOP / token",
+         (f"{rp['gflops_per_tok']:.4f}" if ref else ""),
+         f"{cp['gflops_per_tok']:.4f}",
+         (_fmt_vs(cp['gflops_per_tok'], rp['gflops_per_tok'], "compute") if ref else "")],
+        ["disk size (MiB)",
+         (f"{rp['disk_size_bytes']/(1<<20):.1f}" if ref and rp['disk_size_bytes'] else ""),
+         f"{cp['disk_size_bytes']/(1<<20):.1f}",
+         (_fmt_vs(cp['disk_size_bytes'], rp['disk_size_bytes'], "size") if ref and rp['disk_size_bytes'] else "")],
+        ["weights in RAM (MiB)",
+         (f"{rp['weight_bytes']/(1<<20):.1f}" if ref else ""),
+         f"{cp['weight_bytes']/(1<<20):.1f}",
+         (_fmt_vs(cp['weight_bytes'], rp['weight_bytes'], "size") if ref else "")],
+        ["peak GPU mem eval (MiB)",
+         (f"{rp['peak_eval_mem_bytes']/(1<<20):.1f}" if ref else ""),
+         f"{cp['peak_eval_mem_bytes']/(1<<20):.1f}",
+         (_fmt_vs(cp['peak_eval_mem_bytes'], rp['peak_eval_mem_bytes'], "memory") if ref else "")],
+        ["throughput (tok/s)",
+         (f"{ref['throughput_tok_s']:.0f}" if ref else ""),
+         f"{cand['throughput_tok_s']:.0f}",
+         (_fmt_vs(cand['throughput_tok_s'], ref['throughput_tok_s'], "speed") if ref else "")],
+    ]
+    widths = [max(len(r[i]) for r in [headers] + rows) for i in range(4)]
+    sep = "  " + "  ".join("-" * w for w in widths)
+    A("  " + "  ".join(h.ljust(widths[i]) for i, h in enumerate(headers)))
+    A(sep)
+    for r in rows:
+        A("  " + "  ".join(r[i].ljust(widths[i]) for i in range(4)))
+    A("")
+    if pair is not None:
+        # Display labels: A = ref (teacher), B = cand (student).
+        a_lbl = ref["label"] if ref is not None else "A"
+        b_lbl = cand["label"]
+        A(f"=== Pairwise (viterbi, all gold tokens) — A: {a_lbl} vs B: {b_lbl} ===")
+        total = (pair["both_correct"] + pair["only_cand_correct"]
+                 + pair["only_ref_correct"] + pair["both_wrong"])
+        # Display order: agreement, A-only, B-only, both-wrong.
+        rows_all = [
+            ("both_correct",       pair["both_correct"]),
+            ("only_A_correct",     pair["only_ref_correct"]),
+            ("only_B_correct",     pair["only_cand_correct"]),
+            ("both_wrong",         pair["both_wrong"]),
+        ]
+        for k, v in rows_all:
+            A(f"  {k:<26s} {v:8d}  ({100.0*v/total:.2f}%)")
+        A("")
+        A(f"=== Pairwise (viterbi, gold non-O tokens) — A: {a_lbl} vs B: {b_lbl} ===")
+        non_o = {k: 0 for k in ["both_correct", "only_cand", "only_ref", "both_wrong"]}
+        for cat, d in pair["by_cat"].items():
+            if cat == "O":
+                continue
+            for k in non_o:
+                non_o[k] += d[k]
+        total_non_o = sum(non_o.values())
+        rows_non_o = [
+            ("both_correct",   non_o["both_correct"]),
+            ("only_A_correct", non_o["only_ref"]),
+            ("only_B_correct", non_o["only_cand"]),
+            ("both_wrong",     non_o["both_wrong"]),
+        ]
+        for k, v in rows_non_o:
+            A(f"  {k:<26s} {v:8d}  ({100.0*v/total_non_o:.2f}%)" if total_non_o else f"  {k}: 0")
+        A("")
+        # Per-cat net B-wins. Net = (only_B) - (only_A) = (only_cand) - (only_ref).
+        # Negative net = A (teacher) wins more in this category.
+        nets = []
+        for cat, d in pair["by_cat"].items():
+            if cat == "O":
+                continue
+            nets.append((cat, d["only_cand"] - d["only_ref"],
+                         d["only_cand"], d["only_ref"], d["both_wrong"]))
+        nets.sort(key=lambda x: x[1])
+        A(f"=== Worst B-net wins by gold category — A: {a_lbl} ahead (top 15) ===")
+        for cat, net, ob, oa, bw in nets[:15]:
+            A(f"  {cat:<32s} net_B={net:+5d}  A_only={oa:4d}  B_only={ob:4d}  both_wrong={bw:4d}")
+        A("")
+        A(f"=== Best B-net wins by gold category — B: {b_lbl} ahead (top 15) ===")
+        for cat, net, ob, oa, bw in nets[::-1][:15]:
+            A(f"  {cat:<32s} net_B={net:+5d}  A_only={oa:4d}  B_only={ob:4d}  both_wrong={bw:4d}")
+        A("")
+    A("=== Per-category span F1 (viterbi) ===")
+    if ref is not None:
+        A(f"  -- A: {ref['label']} --")
+        per_r = ref["viterbi"]["span_per_cat"]
+        for cat in sorted(per_r):
+            c = per_r[cat]
+            A(f"    {cat:<32s} F1={c['f1']:.4f}  P={c['precision']:.4f}  R={c['recall']:.4f}  "
+              f"({c['n_gold']}/{c['n_pred']}/{c['n_correct']})")
+        A(f"  -- B: {cand['label']} --")
+    per = cand["viterbi"]["span_per_cat"]
+    for cat in sorted(per):
+        c = per[cat]
+        A(f"    {cat:<32s} F1={c['f1']:.4f}  P={c['precision']:.4f}  R={c['recall']:.4f}  "
+          f"({c['n_gold']}/{c['n_pred']}/{c['n_correct']})")
+    path.write_text("\n".join(lines) + "\n")
+    print(f"[log] wrote {path}", flush=True)
+def _fmt_bytes(n: int) -> str:
+    if n <= 0:
+        return "—"
+    if n >= 1 << 30:
+        return f"{n / (1 << 30):.2f} GiB"
+    if n >= 1 << 20:
+        return f"{n / (1 << 20):.1f} MiB"
+    return f"{n / (1 << 10):.1f} KiB"
+def _perf_block(stream, ctx: int) -> List[str]:
+    p = stream["perf"]
+    out = [
+        f"  total params              : {p['total_params']/1e6:>9.2f}M  "
+        f"({p['other_total']/1e6:.2f}M dense + {p['moe_total']/1e6:.2f}M MoE-experts)",
+        f"  active params / token     : {p['active_params_per_tok']/1e6:>9.2f}M  "
+        f"(memory footprint — embed lookup + top_{p['experts_per_tok']}/{p['num_experts']} experts: "
+        f"{p['embed_total']/1e6:.2f}M embed + "
+        f"{p['moe_active_per_tok']/1e6:.2f}M MoE-active + "
+        f"{(p['other_total']-p['embed_total'])/1e6:.2f}M attn/norm/head)",
+        f"  compute params / token    : {p['compute_params_per_tok']/1e6:>9.2f}M  "
+        f"(matmul FLOPs only — embedding lookup excluded)",
+        f"  GFLOP / token (fwd, MAC×2): {p['gflops_per_tok']:>9.3f}",
+        f"  weights size (on disk)    : {_fmt_bytes(p['disk_size_bytes']):>9s}",
+        f"  weights size (in RAM)     : {_fmt_bytes(p['weight_bytes']):>9s}",
+        f"  weights resident (GPU)    : {_fmt_bytes(p['weights_resident_bytes']):>9s}",
+        f"  peak GPU mem (eval, ctx={ctx})    : {_fmt_bytes(p['peak_eval_mem_bytes']):>9s}",
+    ]
+    return out
+def _write_infer_log(path: Path, cand, ref, args, sample_text: str, tokenizer, device, dtype):
+    """Single-doc inference example + timing + performance metrics."""
+    from modeling_haremb_pii import _bioes_viterbi_batched
+    lines: List[str] = []
+    A = lines.append
+    A(f"Inference benchmark: A: {ref['label']} vs B: {cand['label']}"
+      if ref else f"Inference benchmark: {cand['label']}")
+    A(f"  device  : {device}  dtype: {dtype}")
+    A(f"  ctx     : {args.max_length}")
+    A("")
+    if ref is not None:
+        A(f"A: {ref['label']} (reference / teacher)")
+        A(f"  load    : {ref['load_s']:.2f}s")
+        A(f"  eval    : {ref['eval_s']:.2f}s on {ref['n_tok']:,} tokens "
+          f"({ref['throughput_tok_s']:.0f} tok/s)")
+        A("Performance:")
+        for ln in _perf_block(ref, args.max_length):
+            A(ln)
+        A("")
+    A(f"B: {cand['label']}" + (" (this checkpoint)" if ref else ""))
+    A(f"  load    : {cand['load_s']:.2f}s")
+    A(f"  eval    : {cand['eval_s']:.2f}s on {cand['n_tok']:,} tokens "
+      f"({cand['throughput_tok_s']:.0f} tok/s)")
+    A("Performance:")
+    for ln in _perf_block(cand, args.max_length):
+        A(ln)
+    A("")
+    if ref is not None:
+        cp, rp = cand["perf"], ref["perf"]
+        def _fmt(b, a, kind):
+            if a is None or a == 0 or b is None or b == 0:
+                return "—"
+            r = b / a
+            if kind == "size":
+                return f"{1.0/r:.2f}× smaller" if r <= 1.0 else f"{r:.2f}× larger"
+            if kind == "compute":
+                return f"{1.0/r:.2f}× cheaper" if r <= 1.0 else f"{r:.2f}× more"
+            if kind == "memory":
+                return f"{1.0/r:.2f}× less" if r <= 1.0 else f"{r:.2f}× more"
+            if kind == "speed":
+                return f"{r:.2f}× faster" if r >= 1.0 else f"{1.0/r:.2f}× slower"
+            return f"{r:.2f}×"
+        A(f"B vs A  ({cand['label']} vs {ref['label']}):")
+        A(f"  total params              : {_fmt(cp['total_params'], rp['total_params'], 'size')}")
+        A(f"  active params / token     : {_fmt(cp['active_params_per_tok'], rp['active_params_per_tok'], 'memory')}  [memory]")
+        A(f"  compute params / token    : {_fmt(cp['compute_params_per_tok'], rp['compute_params_per_tok'], 'compute')}  [FLOPs]")
+        A(f"  GFLOP / token             : {_fmt(cp['gflops_per_tok'], rp['gflops_per_tok'], 'compute')}")
+        if rp['disk_size_bytes']:
+            A(f"  weights size (on disk)    : {_fmt(cp['disk_size_bytes'], rp['disk_size_bytes'], 'size')}")
+        else:
+            A(f"  weights size (on disk)    : —")
+        A(f"  weights in RAM            : {_fmt(cp['weight_bytes'], rp['weight_bytes'], 'size')}")
+        A(f"  peak GPU mem (eval)       : {_fmt(cp['peak_eval_mem_bytes'], rp['peak_eval_mem_bytes'], 'memory')}")
+        A(f"  throughput                : {_fmt(cand['throughput_tok_s'], ref['throughput_tok_s'], 'speed')}")
+        A("")
+    A("Sample inference (load → tokenize → forward → viterbi-decode → spans):")
+    A(f"  text: {sample_text!r}")
+    model = AutoModelForTokenClassification.from_pretrained(
+        ".", dtype=dtype, trust_remote_code=True,
+    ).to(device).eval()
+    if hasattr(model.config, "viterbi_replace_logits"):
+        model.config.viterbi_replace_logits = True
+    enc = tokenizer(sample_text, return_tensors="pt", truncation=True,
+                    max_length=args.max_length).to(device)
+    with torch.no_grad():
+        if torch.cuda.is_available():
+            torch.cuda.synchronize()
+        t0 = time.time()
+        out = model(**enc)
+        if torch.cuda.is_available():
+            torch.cuda.synchronize()
+        dt = time.time() - t0
+    label2id, id2label = nemotron_native_label_space()
+    pred = out.logits.argmax(-1)[0].cpu().tolist()
+    spans = _bioes_to_spans(pred, id2label, 0)
+    A(f"  forward latency: {dt*1000:.1f}ms ({enc.input_ids.shape[1]} tokens)")
+    A(f"  detected {len(spans)} spans:")
+    tok_ids = enc.input_ids[0].cpu().tolist()
+    for s, e, cat in sorted(spans):
+        text = tokenizer.decode(tok_ids[s:e]).strip()
+        A(f"    [{s:3d}, {e:3d})  {cat:<28s}  {text!r}")
+    del model
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+    path.write_text("\n".join(lines) + "\n")
+    print(f"[log] wrote {path}", flush=True)
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+def main():
+    p = argparse.ArgumentParser(description="Benchmark haremb-privacy-filter-opennemo")
+    p.add_argument("--device", default=None,
+                   help="cuda or cpu. Default: cuda if available.")
+    p.add_argument("--dtype", default="bfloat16",
+                   choices=["bfloat16", "float16", "float32"])
+    p.add_argument("--eval-pct", type=float, default=1.0,
+                   help="Percent of nvidia/Nemotron-PII test split to use. Default 1%%.")
+    p.add_argument("--eval-chunk-size", type=int, default=10_000)
+    p.add_argument("--seed", type=int, default=42)
+    p.add_argument("--max-length", type=int, default=1024)
+    p.add_argument("--batch-size", type=int, default=4)
+    p.add_argument("--out", type=str, default=".")
+    p.add_argument("--model-path", type=str, default=".",
+                   help="Path to this checkpoint. Default: ./ (this folder).")
+    p.add_argument("--no-base", action="store_true",
+                   help="Skip the OpenMed teacher comparison.")
+    p.add_argument("--no-plots", action="store_true",
+                   help="Skip rendering eval_summary.png / eval_confusion.png.")
+    args = p.parse_args()
+    out_dir = Path(args.out).resolve()
+    out_dir.mkdir(parents=True, exist_ok=True)
+    if args.device is None:
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    else:
+        device = torch.device(args.device)
+    dtype = {"bfloat16": torch.bfloat16, "float16": torch.float16,
+             "float32": torch.float32}[args.dtype]
+    print(f"[setup] device={device} dtype={dtype} model={args.model_path} "
+          f"out={out_dir}", flush=True)
+    label2id, id2label = nemotron_native_label_space()
+    o_id = label2id["O"]
+    tokenizer = AutoTokenizer.from_pretrained(args.model_path)
+    pad_token_id = tokenizer.pad_token_id or 199999
+    # Build the eval set (same slice the README headline numbers reference)
+    print(f"[data] loading {SOURCE_DATASET} ...", flush=True)
+    ds = load_dataset(SOURCE_DATASET)
+    target_eval = max(1, int(round(len(ds["test"]) * args.eval_pct / 100.0)))
+    eval_indices = _build_eval_streaming(
+        ds["test"], target_n=target_eval,
+        chunk_size=args.eval_chunk_size, seed=args.seed,
+    )
+    eval_ds = _NemotronEvalDataset(
+        ds["test"].select(eval_indices), tokenizer, label2id, args.max_length,
+    )
+    args.n_docs = len(eval_ds)
+    print(f"[data] eval={args.n_docs:,} docs ({args.eval_pct:.2f}% of test split)",
+          flush=True)
+    # BIOES decoding masks (used for the explicit RAW vs VITERBI streams).
+    from modeling_haremb_pii import (
+        _build_bioes_initial_mask as _bld_init,
+        _build_bioes_transition_mask as _bld_trans,
+    )
+    bioes_trans = _bld_trans(id2label).to(device).float()
+    bioes_init  = _bld_init(id2label).to(device).float()
+    # Eval candidate
+    cand = _eval_one_model(
+        args.model_path, tokenizer, eval_ds, label2id, id2label, o_id,
+        bioes_trans, bioes_init,
+        args.batch_size, args.max_length, device, dtype,
+        label="haremb",
+    )
+    print(f"\n=== {cand['label']} ===")
+    print(f"RAW     {_fmt_metrics(cand['raw'])}")
+    print(f"VITERBI {_fmt_metrics(cand['viterbi'])}")
+    print(f"DELTA   span_F1={cand['viterbi']['span_f1']-cand['raw']['span_f1']:+.4f} "
+          f"P={cand['viterbi']['span_precision']-cand['raw']['span_precision']:+.4f} "
+          f"R={cand['viterbi']['span_recall']-cand['raw']['span_recall']:+.4f}")
+    ref = None
+    pair = None
+    if not args.no_base:
+        ref = _eval_one_model(
+            TEACHER, tokenizer, eval_ds, label2id, id2label, o_id,
+            bioes_trans, bioes_init,
+            args.batch_size, args.max_length, device, dtype,
+            label="openmed-base",
+        )
+        print(f"\n=== {ref['label']} (teacher) ===")
+        print(f"VITERBI {_fmt_metrics(ref['viterbi'])}")
+        pair = _pairwise(cand["docs"], ref["docs"], id2label, o_id)
+        d = ref["viterbi"]["span_f1"] - cand["viterbi"]["span_f1"]
+        print(f"\nGap to teacher (viterbi span_F1): {d:+.4f}")
+    # Reports
+    sample_text = ("Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, "
+                   "phone 415-555-0123, email sarah.johnson@example.com, "
+                   "credit card 4111-1111-1111-1111.")
+    _write_infer_log(out_dir / "infer.log", cand, ref, args, sample_text,
+                     tokenizer, device, dtype)
+    _write_compare_log(out_dir / "compare.log", cand, ref, pair, args)
+    if not args.no_plots:
+        _render_plots(cand, ref, pair, out_dir,
+                      cand_label=cand["label"],
+                      ref_label=ref["label"] if ref else "")
+    print("\n[done]")
+if __name__ == "__main__":
+    main()

compare.log ADDED Viewed

	@@ -0,0 +1,188 @@

+Benchmark: A: openmed-base vs B: haremb
+Dataset:   nvidia/Nemotron-PII, split=test, eval_pct=1.0, ctx=1024, seed=42, n_docs=1000
+Eval tokens scored: 212,909
+=== Aggregate ===
+  A: openmed-base              RAW     span_F1=0.9174  P=0.9125  R=0.9223  token_acc=0.9895  non_o_recall=0.9685  spans=8627/8720/7957
+  A: openmed-base              VITERBI span_F1=0.9434  P=0.9531  R=0.9338  token_acc=0.9900  non_o_recall=0.9703  spans=8627/8452/8056
+  B: haremb                    RAW     span_F1=0.7741  P=0.7186  R=0.8388  token_acc=0.9831  non_o_recall=0.9467  spans=8627/10069/7236
+  B: haremb                    VITERBI span_F1=0.9288  P=0.9396  R=0.9182  token_acc=0.9885  non_o_recall=0.9637  spans=8627/8430/7921
+Gap B vs A (viterbi span_F1): -0.0146
+Throughput: A: openmed-base 3293 tok/s  (1399.61M params)
+            B: haremb 6343 tok/s  (287.11M params)
+=== Performance ===
+  metric                           A: openmed-base  B: haremb  B vs A
+  -------------------------------  ---------------  ---------  -------------
+  total params (M)                 1399.61          287.11     4.87× smaller
+  dense params (M)                 139.35           129.58     1.08× smaller
+  MoE expert params (M)            1260.26          157.53     8.00× smaller
+  active params/token (M, mem)     178.73           134.50     1.33× less
+  compute params/token (M, FLOPs)  50.69            6.46       7.85× cheaper
+  GFLOP / token                    0.1014           0.0129     7.85× cheaper
+  disk size (MiB)                                   547.6
+  weights in RAM (MiB)             2669.5           547.6      4.87× smaller
+  peak GPU mem eval (MiB)          3376.2           1248.6     2.70× less
+  throughput (tok/s)               3293             6343       1.93× faster
+=== Pairwise (viterbi, all gold tokens) — A: openmed-base vs B: haremb ===
+  both_correct                 209830  (98.55%)
+  only_A_correct                  958  (0.45%)
+  only_B_correct                  633  (0.30%)
+  both_wrong                     1488  (0.70%)
+=== Pairwise (viterbi, gold non-O tokens) — A: openmed-base vs B: haremb ===
+  both_correct                  43902  (95.47%)
+  only_A_correct                  717  (1.56%)
+  only_B_correct                  417  (0.91%)
+  both_wrong                      951  (2.07%)
+=== Worst B-net wins by gold category — A: openmed-base ahead (top 15) ===
+  company_name                     net_B= -142  A_only= 216  B_only=  74  both_wrong= 119
+  first_name                       net_B=  -75  A_only=  82  B_only=   7  both_wrong=  19
+  last_name                        net_B=  -65  A_only=  67  B_only=   2  both_wrong=  38
+  occupation                       net_B=  -55  A_only=  79  B_only=  24  both_wrong= 286
+  device_identifier                net_B=  -29  A_only=  29  B_only=   0  both_wrong=   0
+  user_name                        net_B=  -26  A_only=  29  B_only=   3  both_wrong=  10
+  city                             net_B=  -16  A_only=  32  B_only=  16  both_wrong=  36
+  street_address                   net_B=  -13  A_only=  14  B_only=   1  both_wrong=   0
+  date_of_birth                    net_B=  -12  A_only=  12  B_only=   0  both_wrong=   0
+  email                            net_B=   -8  A_only=   9  B_only=   1  both_wrong=   0
+  medical_record_number            net_B=   -7  A_only=   7  B_only=   0  both_wrong=   0
+  phone_number                     net_B=   -6  A_only=   6  B_only=   0  both_wrong=   0
+  account_number                   net_B=   -6  A_only=  10  B_only=   4  both_wrong=   0
+  tax_id                           net_B=   -6  A_only=   6  B_only=   0  both_wrong=   0
+  race_ethnicity                   net_B=   -5  A_only=  15  B_only=  10  both_wrong=  11
+=== Best B-net wins by gold category — B: haremb ahead (top 15) ===
+  date                             net_B=  +46  A_only=  15  B_only=  61  both_wrong= 145
+  fax_number                       net_B=  +29  A_only=   1  B_only=  30  both_wrong=  44
+  unique_id                        net_B=  +26  A_only=   4  B_only=  30  both_wrong=   0
+  ssn                              net_B=  +18  A_only=   0  B_only=  18  both_wrong=   0
+  time                             net_B=  +12  A_only=   9  B_only=  21  both_wrong=  87
+  political_view                   net_B=  +11  A_only=   3  B_only=  14  both_wrong=   6
+  coordinate                       net_B=  +11  A_only=   0  B_only=  11  both_wrong=   0
+  customer_id                      net_B=   +8  A_only=   4  B_only=  12  both_wrong=   7
+  certificate_license_number       net_B=   +7  A_only=   0  B_only=   7  both_wrong=   0
+  education_level                  net_B=   +6  A_only=   2  B_only=   8  both_wrong=  28
+  state                            net_B=   +6  A_only=  13  B_only=  19  both_wrong=  20
+  blood_type                       net_B=   +6  A_only=   0  B_only=   6  both_wrong=   0
+  gender                           net_B=   +2  A_only=   0  B_only=   2  both_wrong=   2
+  http_cookie                      net_B=   +2  A_only=   0  B_only=   2  both_wrong=   3
+  country                          net_B=   +2  A_only=   0  B_only=   2  both_wrong=  19
+=== Per-category span F1 (viterbi) ===
+  -- A: openmed-base --
+    account_number                   F1=0.9929  P=0.9929  R=0.9929  (140/140/139)
+    age                              F1=0.8840  P=0.8511  R=0.9195  (87/94/80)
+    api_key                          F1=0.9921  P=0.9844  R=1.0000  (63/64/63)
+    bank_routing_number              F1=0.9867  P=0.9867  R=0.9867  (75/75/74)
+    biometric_identifier             F1=1.0000  P=1.0000  R=1.0000  (113/113/113)
+    blood_type                       F1=0.9032  P=0.9032  R=0.9032  (62/62/56)
+    certificate_license_number       F1=0.9697  P=1.0000  R=0.9412  (34/32/32)
+    city                             F1=0.9154  P=0.9583  R=0.8762  (210/192/184)
+    company_name                     F1=0.8824  P=0.9143  R=0.8526  (563/525/480)
+    coordinate                       F1=0.8000  P=0.8000  R=0.8000  (55/55/44)
+    country                          F1=0.9431  P=0.9324  R=0.9539  (217/222/207)
+    county                           F1=0.9519  P=0.9612  R=0.9429  (105/103/99)
+    credit_debit_card                F1=0.9967  P=0.9934  R=1.0000  (150/151/150)
+    customer_id                      F1=0.9849  P=1.0000  R=0.9703  (202/196/196)
+    cvv                              F1=0.9787  P=1.0000  R=0.9583  (48/46/46)
+    date                             F1=0.9440  P=0.9571  R=0.9312  (814/792/758)
+    date_of_birth                    F1=1.0000  P=1.0000  R=1.0000  (164/164/164)
+    date_time                        F1=0.9635  P=0.9429  R=0.9851  (134/140/132)
+    device_identifier                F1=0.9714  P=0.9444  R=1.0000  (51/54/51)
+    education_level                  F1=0.9091  P=0.9524  R=0.8696  (92/84/80)
+    email                            F1=0.9971  P=0.9961  R=0.9980  (511/512/510)
+    employee_id                      F1=0.9948  P=1.0000  R=0.9896  (96/95/95)
+    employment_status                F1=0.9478  P=0.9593  R=0.9365  (126/123/118)
+    fax_number                       F1=0.9091  P=0.9848  R=0.8442  (77/66/65)
+    first_name                       F1=0.9766  P=0.9716  R=0.9816  (871/880/855)
+    gender                           F1=0.9737  P=0.9867  R=0.9610  (77/75/74)
+    health_plan_beneficiary_number   F1=1.0000  P=1.0000  R=1.0000  (103/103/103)
+    http_cookie                      F1=0.9307  P=0.9400  R=0.9216  (51/50/47)
+    ipv4                             F1=1.0000  P=1.0000  R=1.0000  (59/59/59)
+    ipv6                             F1=1.0000  P=1.0000  R=1.0000  (21/21/21)
+    language                         F1=0.9000  P=0.9000  R=0.9000  (90/90/81)
+    last_name                        F1=0.9744  P=0.9767  R=0.9721  (646/643/628)
+    license_plate                    F1=1.0000  P=1.0000  R=1.0000  (55/55/55)
+    mac_address                      F1=1.0000  P=1.0000  R=1.0000  (30/30/30)
+    medical_record_number            F1=1.0000  P=1.0000  R=1.0000  (103/103/103)
+    national_id                      F1=1.0000  P=1.0000  R=1.0000  (28/28/28)
+    occupation                       F1=0.6522  P=0.7721  R=0.5645  (372/272/210)
+    password                         F1=0.9217  P=0.9636  R=0.8833  (60/55/53)
+    phone_number                     F1=0.9751  P=0.9514  R=1.0000  (235/247/235)
+    pin                              F1=0.9302  P=0.8955  R=0.9677  (62/67/60)
+    political_view                   F1=0.8387  P=0.8125  R=0.8667  (45/48/39)
+    postcode                         F1=0.9934  P=0.9868  R=1.0000  (75/76/75)
+    race_ethnicity                   F1=0.8889  P=0.8889  R=0.8889  (81/81/72)
+    religious_belief                 F1=0.8936  P=0.8750  R=0.9130  (46/48/42)
+    sexuality                        F1=0.9667  P=1.0000  R=0.9355  (31/29/29)
+    ssn                              F1=0.9440  P=0.9365  R=0.9516  (62/63/59)
+    state                            F1=0.9198  P=0.9399  R=0.9005  (191/183/172)
+    street_address                   F1=0.9894  P=0.9842  R=0.9947  (188/190/187)
+    swift_bic                        F1=0.9905  P=0.9811  R=1.0000  (52/53/52)
+    tax_id                           F1=1.0000  P=1.0000  R=1.0000  (15/15/15)
+    time                             F1=0.8209  P=0.8514  R=0.7926  (188/175/149)
+    unique_id                        F1=0.9600  P=1.0000  R=0.9231  (13/12/12)
+    url                              F1=0.9725  P=0.9687  R=0.9763  (380/383/371)
+    user_name                        F1=0.9497  P=0.9264  R=0.9742  (155/163/151)
+    vehicle_identifier               F1=0.9815  P=0.9636  R=1.0000  (53/55/53)
+  -- B: haremb --
+    account_number                   F1=0.9751  P=0.9716  R=0.9786  (140/141/137)
+    age                              F1=0.8571  P=0.8211  R=0.8966  (87/95/78)
+    api_key                          F1=0.9921  P=0.9844  R=1.0000  (63/64/63)
+    bank_routing_number              F1=0.9933  P=1.0000  R=0.9867  (75/74/74)
+    biometric_identifier             F1=1.0000  P=1.0000  R=1.0000  (113/113/113)
+    blood_type                       F1=1.0000  P=1.0000  R=1.0000  (62/62/62)
+    certificate_license_number       F1=0.9855  P=0.9714  R=1.0000  (34/35/34)
+    city                             F1=0.8932  P=0.9109  R=0.8762  (210/202/184)
+    company_name                     F1=0.7766  P=0.8120  R=0.7442  (563/516/419)
+    coordinate                       F1=1.0000  P=1.0000  R=1.0000  (55/55/55)
+    country                          F1=0.9543  P=0.9457  R=0.9631  (217/221/209)
+    county                           F1=0.9340  P=0.9252  R=0.9429  (105/107/99)
+    credit_debit_card                F1=0.9934  P=0.9868  R=1.0000  (150/152/150)
+    customer_id                      F1=0.9779  P=0.9707  R=0.9851  (202/205/199)
+    cvv                              F1=0.9792  P=0.9792  R=0.9792  (48/48/47)
+    date                             F1=0.9510  P=0.9599  R=0.9423  (814/799/767)
+    date_of_birth                    F1=0.9939  P=1.0000  R=0.9878  (164/162/162)
+    date_time                        F1=0.9635  P=0.9429  R=0.9851  (134/140/132)
+    device_identifier                F1=0.9515  P=0.9423  R=0.9608  (51/52/49)
+    education_level                  F1=0.9091  P=0.9524  R=0.8696  (92/84/80)
+    email                            F1=0.9912  P=0.9883  R=0.9941  (511/514/508)
+    employee_id                      F1=0.9895  P=1.0000  R=0.9792  (96/94/94)
+    employment_status                F1=0.9562  P=0.9600  R=0.9524  (126/125/120)
+    fax_number                       F1=0.9396  P=0.9722  R=0.9091  (77/72/70)
+    first_name                       F1=0.9299  P=0.9231  R=0.9369  (871/884/816)
+    gender                           F1=0.9870  P=0.9870  R=0.9870  (77/77/76)
+    health_plan_beneficiary_number   F1=1.0000  P=1.0000  R=1.0000  (103/103/103)
+    http_cookie                      F1=0.9608  P=0.9608  R=0.9608  (51/51/49)
+    ipv4                             F1=1.0000  P=1.0000  R=1.0000  (59/59/59)
+    ipv6                             F1=1.0000  P=1.0000  R=1.0000  (21/21/21)
+    language                         F1=0.8966  P=0.9286  R=0.8667  (90/84/78)
+    last_name                        F1=0.9308  P=0.9457  R=0.9164  (646/626/592)
+    license_plate                    F1=1.0000  P=1.0000  R=1.0000  (55/55/55)
+    mac_address                      F1=1.0000  P=1.0000  R=1.0000  (30/30/30)
+    medical_record_number            F1=0.9903  P=0.9903  R=0.9903  (103/103/102)
+    national_id                      F1=0.9825  P=0.9655  R=1.0000  (28/29/28)
+    occupation                       F1=0.5981  P=0.7440  R=0.5000  (372/250/186)
+    password                         F1=0.9391  P=0.9818  R=0.9000  (60/55/54)
+    phone_number                     F1=0.9730  P=0.9512  R=0.9957  (235/246/234)
+    pin                              F1=0.9508  P=0.9667  R=0.9355  (62/60/58)
+    political_view                   F1=0.8723  P=0.8367  R=0.9111  (45/49/41)
+    postcode                         F1=0.9934  P=0.9868  R=1.0000  (75/76/75)
+    race_ethnicity                   F1=0.8590  P=0.8933  R=0.8272  (81/75/67)
+    religious_belief                 F1=0.9348  P=0.9348  R=0.9348  (46/46/43)
+    sexuality                        F1=0.9492  P=1.0000  R=0.9032  (31/28/28)
+    ssn                              F1=0.9688  P=0.9394  R=1.0000  (62/66/62)
+    state                            F1=0.9105  P=0.9153  R=0.9058  (191/189/173)
+    street_address                   F1=0.9894  P=0.9894  R=0.9894  (188/188/186)
+    swift_bic                        F1=0.9905  P=0.9811  R=1.0000  (52/53/52)
+    tax_id                           F1=0.9655  P=1.0000  R=0.9333  (15/14/14)
+    time                             F1=0.8421  P=0.8786  R=0.8085  (188/173/152)
+    unique_id                        F1=0.8571  P=0.8000  R=0.9231  (13/15/12)
+    url                              F1=0.9752  P=0.9688  R=0.9816  (380/385/373)
+    user_name                        F1=0.9416  P=0.9477  R=0.9355  (155/153/145)
+    vehicle_identifier               F1=0.9630  P=0.9455  R=0.9811  (53/55/52)

config.json ADDED Viewed

	@@ -0,0 +1,788 @@

+{
+  "architectures": [
+    "HaremPiiForTokenClassification"
+  ],
+  "attention_bias": true,
+  "attention_dropout": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration_haremb_pii.HaremPiiConfig",
+    "AutoModelForTokenClassification": "modeling_haremb_pii.HaremPiiForTokenClassification"
+  },
+  "bos_token_id": null,
+  "classifier_dropout": 0.0,
+  "default_n_ctx": 128000,
+  "dtype": "bfloat16",
+  "eos_token_id": 199999,
+  "head_dim": 64,
+  "hidden_act": "silu",
+  "hidden_size": 640,
+  "id2label": {
+    "0": "O",
+    "1": "B-account_number",
+    "2": "I-account_number",
+    "3": "E-account_number",
+    "4": "S-account_number",
+    "5": "B-age",
+    "6": "I-age",
+    "7": "E-age",
+    "8": "S-age",
+    "9": "B-api_key",
+    "10": "I-api_key",
+    "11": "E-api_key",
+    "12": "S-api_key",
+    "13": "B-bank_routing_number",
+    "14": "I-bank_routing_number",
+    "15": "E-bank_routing_number",
+    "16": "S-bank_routing_number",
+    "17": "B-biometric_identifier",
+    "18": "I-biometric_identifier",
+    "19": "E-biometric_identifier",
+    "20": "S-biometric_identifier",
+    "21": "B-blood_type",
+    "22": "I-blood_type",
+    "23": "E-blood_type",
+    "24": "S-blood_type",
+    "25": "B-certificate_license_number",
+    "26": "I-certificate_license_number",
+    "27": "E-certificate_license_number",
+    "28": "S-certificate_license_number",
+    "29": "B-city",
+    "30": "I-city",
+    "31": "E-city",
+    "32": "S-city",
+    "33": "B-company_name",
+    "34": "I-company_name",
+    "35": "E-company_name",
+    "36": "S-company_name",
+    "37": "B-coordinate",
+    "38": "I-coordinate",
+    "39": "E-coordinate",
+    "40": "S-coordinate",
+    "41": "B-country",
+    "42": "I-country",
+    "43": "E-country",
+    "44": "S-country",
+    "45": "B-county",
+    "46": "I-county",
+    "47": "E-county",
+    "48": "S-county",
+    "49": "B-credit_debit_card",
+    "50": "I-credit_debit_card",
+    "51": "E-credit_debit_card",
+    "52": "S-credit_debit_card",
+    "53": "B-customer_id",
+    "54": "I-customer_id",
+    "55": "E-customer_id",
+    "56": "S-customer_id",
+    "57": "B-cvv",
+    "58": "I-cvv",
+    "59": "E-cvv",
+    "60": "S-cvv",
+    "61": "B-date",
+    "62": "I-date",
+    "63": "E-date",
+    "64": "S-date",
+    "65": "B-date_of_birth",
+    "66": "I-date_of_birth",
+    "67": "E-date_of_birth",
+    "68": "S-date_of_birth",
+    "69": "B-date_time",
+    "70": "I-date_time",
+    "71": "E-date_time",
+    "72": "S-date_time",
+    "73": "B-device_identifier",
+    "74": "I-device_identifier",
+    "75": "E-device_identifier",
+    "76": "S-device_identifier",
+    "77": "B-education_level",
+    "78": "I-education_level",
+    "79": "E-education_level",
+    "80": "S-education_level",
+    "81": "B-email",
+    "82": "I-email",
+    "83": "E-email",
+    "84": "S-email",
+    "85": "B-employee_id",
+    "86": "I-employee_id",
+    "87": "E-employee_id",
+    "88": "S-employee_id",
+    "89": "B-employment_status",
+    "90": "I-employment_status",
+    "91": "E-employment_status",
+    "92": "S-employment_status",
+    "93": "B-fax_number",
+    "94": "I-fax_number",
+    "95": "E-fax_number",
+    "96": "S-fax_number",
+    "97": "B-first_name",
+    "98": "I-first_name",
+    "99": "E-first_name",
+    "100": "S-first_name",
+    "101": "B-gender",
+    "102": "I-gender",
+    "103": "E-gender",
+    "104": "S-gender",
+    "105": "B-health_plan_beneficiary_number",
+    "106": "I-health_plan_beneficiary_number",
+    "107": "E-health_plan_beneficiary_number",
+    "108": "S-health_plan_beneficiary_number",
+    "109": "B-http_cookie",
+    "110": "I-http_cookie",
+    "111": "E-http_cookie",
+    "112": "S-http_cookie",
+    "113": "B-ipv4",
+    "114": "I-ipv4",
+    "115": "E-ipv4",
+    "116": "S-ipv4",
+    "117": "B-ipv6",
+    "118": "I-ipv6",
+    "119": "E-ipv6",
+    "120": "S-ipv6",
+    "121": "B-language",
+    "122": "I-language",
+    "123": "E-language",
+    "124": "S-language",
+    "125": "B-last_name",
+    "126": "I-last_name",
+    "127": "E-last_name",
+    "128": "S-last_name",
+    "129": "B-license_plate",
+    "130": "I-license_plate",
+    "131": "E-license_plate",
+    "132": "S-license_plate",
+    "133": "B-mac_address",
+    "134": "I-mac_address",
+    "135": "E-mac_address",
+    "136": "S-mac_address",
+    "137": "B-medical_record_number",
+    "138": "I-medical_record_number",
+    "139": "E-medical_record_number",
+    "140": "S-medical_record_number",
+    "141": "B-national_id",
+    "142": "I-national_id",
+    "143": "E-national_id",
+    "144": "S-national_id",
+    "145": "B-occupation",
+    "146": "I-occupation",
+    "147": "E-occupation",
+    "148": "S-occupation",
+    "149": "B-password",
+    "150": "I-password",
+    "151": "E-password",
+    "152": "S-password",
+    "153": "B-phone_number",
+    "154": "I-phone_number",
+    "155": "E-phone_number",
+    "156": "S-phone_number",
+    "157": "B-pin",
+    "158": "I-pin",
+    "159": "E-pin",
+    "160": "S-pin",
+    "161": "B-political_view",
+    "162": "I-political_view",
+    "163": "E-political_view",
+    "164": "S-political_view",
+    "165": "B-postcode",
+    "166": "I-postcode",
+    "167": "E-postcode",
+    "168": "S-postcode",
+    "169": "B-race_ethnicity",
+    "170": "I-race_ethnicity",
+    "171": "E-race_ethnicity",
+    "172": "S-race_ethnicity",
+    "173": "B-religious_belief",
+    "174": "I-religious_belief",
+    "175": "E-religious_belief",
+    "176": "S-religious_belief",
+    "177": "B-sexuality",
+    "178": "I-sexuality",
+    "179": "E-sexuality",
+    "180": "S-sexuality",
+    "181": "B-ssn",
+    "182": "I-ssn",
+    "183": "E-ssn",
+    "184": "S-ssn",
+    "185": "B-state",
+    "186": "I-state",
+    "187": "E-state",
+    "188": "S-state",
+    "189": "B-street_address",
+    "190": "I-street_address",
+    "191": "E-street_address",
+    "192": "S-street_address",
+    "193": "B-swift_bic",
+    "194": "I-swift_bic",
+    "195": "E-swift_bic",
+    "196": "S-swift_bic",
+    "197": "B-tax_id",
+    "198": "I-tax_id",
+    "199": "E-tax_id",
+    "200": "S-tax_id",
+    "201": "B-time",
+    "202": "I-time",
+    "203": "E-time",
+    "204": "S-time",
+    "205": "B-unique_id",
+    "206": "I-unique_id",
+    "207": "E-unique_id",
+    "208": "S-unique_id",
+    "209": "B-url",
+    "210": "I-url",
+    "211": "E-url",
+    "212": "S-url",
+    "213": "B-user_name",
+    "214": "I-user_name",
+    "215": "E-user_name",
+    "216": "S-user_name",
+    "217": "B-vehicle_identifier",
+    "218": "I-vehicle_identifier",
+    "219": "E-vehicle_identifier",
+    "220": "S-vehicle_identifier"
+  },
+  "initial_context_length": 4096,
+  "initializer_range": 0.02,
+  "intermediate_size": 640,
+  "label2id": {
+    "B-account_number": 1,
+    "B-age": 5,
+    "B-api_key": 9,
+    "B-bank_routing_number": 13,
+    "B-biometric_identifier": 17,
+    "B-blood_type": 21,
+    "B-certificate_license_number": 25,
+    "B-city": 29,
+    "B-company_name": 33,
+    "B-coordinate": 37,
+    "B-country": 41,
+    "B-county": 45,
+    "B-credit_debit_card": 49,
+    "B-customer_id": 53,
+    "B-cvv": 57,
+    "B-date": 61,
+    "B-date_of_birth": 65,
+    "B-date_time": 69,
+    "B-device_identifier": 73,
+    "B-education_level": 77,
+    "B-email": 81,
+    "B-employee_id": 85,
+    "B-employment_status": 89,
+    "B-fax_number": 93,
+    "B-first_name": 97,
+    "B-gender": 101,
+    "B-health_plan_beneficiary_number": 105,
+    "B-http_cookie": 109,
+    "B-ipv4": 113,
+    "B-ipv6": 117,
+    "B-language": 121,
+    "B-last_name": 125,
+    "B-license_plate": 129,
+    "B-mac_address": 133,
+    "B-medical_record_number": 137,
+    "B-national_id": 141,
+    "B-occupation": 145,
+    "B-password": 149,
+    "B-phone_number": 153,
+    "B-pin": 157,
+    "B-political_view": 161,
+    "B-postcode": 165,
+    "B-race_ethnicity": 169,
+    "B-religious_belief": 173,
+    "B-sexuality": 177,
+    "B-ssn": 181,
+    "B-state": 185,
+    "B-street_address": 189,
+    "B-swift_bic": 193,
+    "B-tax_id": 197,
+    "B-time": 201,
+    "B-unique_id": 205,
+    "B-url": 209,
+    "B-user_name": 213,
+    "B-vehicle_identifier": 217,
+    "E-account_number": 3,
+    "E-age": 7,
+    "E-api_key": 11,
+    "E-bank_routing_number": 15,
+    "E-biometric_identifier": 19,
+    "E-blood_type": 23,
+    "E-certificate_license_number": 27,
+    "E-city": 31,
+    "E-company_name": 35,
+    "E-coordinate": 39,
+    "E-country": 43,
+    "E-county": 47,
+    "E-credit_debit_card": 51,
+    "E-customer_id": 55,
+    "E-cvv": 59,
+    "E-date": 63,
+    "E-date_of_birth": 67,
+    "E-date_time": 71,
+    "E-device_identifier": 75,
+    "E-education_level": 79,
+    "E-email": 83,
+    "E-employee_id": 87,
+    "E-employment_status": 91,
+    "E-fax_number": 95,
+    "E-first_name": 99,
+    "E-gender": 103,
+    "E-health_plan_beneficiary_number": 107,
+    "E-http_cookie": 111,
+    "E-ipv4": 115,
+    "E-ipv6": 119,
+    "E-language": 123,
+    "E-last_name": 127,
+    "E-license_plate": 131,
+    "E-mac_address": 135,
+    "E-medical_record_number": 139,
+    "E-national_id": 143,
+    "E-occupation": 147,
+    "E-password": 151,
+    "E-phone_number": 155,
+    "E-pin": 159,
+    "E-political_view": 163,
+    "E-postcode": 167,
+    "E-race_ethnicity": 171,
+    "E-religious_belief": 175,
+    "E-sexuality": 179,
+    "E-ssn": 183,
+    "E-state": 187,
+    "E-street_address": 191,
+    "E-swift_bic": 195,
+    "E-tax_id": 199,
+    "E-time": 203,
+    "E-unique_id": 207,
+    "E-url": 211,
+    "E-user_name": 215,
+    "E-vehicle_identifier": 219,
+    "I-account_number": 2,
+    "I-age": 6,
+    "I-api_key": 10,
+    "I-bank_routing_number": 14,
+    "I-biometric_identifier": 18,
+    "I-blood_type": 22,
+    "I-certificate_license_number": 26,
+    "I-city": 30,
+    "I-company_name": 34,
+    "I-coordinate": 38,
+    "I-country": 42,
+    "I-county": 46,
+    "I-credit_debit_card": 50,
+    "I-customer_id": 54,
+    "I-cvv": 58,
+    "I-date": 62,
+    "I-date_of_birth": 66,
+    "I-date_time": 70,
+    "I-device_identifier": 74,
+    "I-education_level": 78,
+    "I-email": 82,
+    "I-employee_id": 86,
+    "I-employment_status": 90,
+    "I-fax_number": 94,
+    "I-first_name": 98,
+    "I-gender": 102,
+    "I-health_plan_beneficiary_number": 106,
+    "I-http_cookie": 110,
+    "I-ipv4": 114,
+    "I-ipv6": 118,
+    "I-language": 122,
+    "I-last_name": 126,
+    "I-license_plate": 130,
+    "I-mac_address": 134,
+    "I-medical_record_number": 138,
+    "I-national_id": 142,
+    "I-occupation": 146,
+    "I-password": 150,
+    "I-phone_number": 154,
+    "I-pin": 158,
+    "I-political_view": 162,
+    "I-postcode": 166,
+    "I-race_ethnicity": 170,
+    "I-religious_belief": 174,
+    "I-sexuality": 178,
+    "I-ssn": 182,
+    "I-state": 186,
+    "I-street_address": 190,
+    "I-swift_bic": 194,
+    "I-tax_id": 198,
+    "I-time": 202,
+    "I-unique_id": 206,
+    "I-url": 210,
+    "I-user_name": 214,
+    "I-vehicle_identifier": 218,
+    "O": 0,
+    "S-account_number": 4,
+    "S-age": 8,
+    "S-api_key": 12,
+    "S-bank_routing_number": 16,
+    "S-biometric_identifier": 20,
+    "S-blood_type": 24,
+    "S-certificate_license_number": 28,
+    "S-city": 32,
+    "S-company_name": 36,
+    "S-coordinate": 40,
+    "S-country": 44,
+    "S-county": 48,
+    "S-credit_debit_card": 52,
+    "S-customer_id": 56,
+    "S-cvv": 60,
+    "S-date": 64,
+    "S-date_of_birth": 68,
+    "S-date_time": 72,
+    "S-device_identifier": 76,
+    "S-education_level": 80,
+    "S-email": 84,
+    "S-employee_id": 88,
+    "S-employment_status": 92,
+    "S-fax_number": 96,
+    "S-first_name": 100,
+    "S-gender": 104,
+    "S-health_plan_beneficiary_number": 108,
+    "S-http_cookie": 112,
+    "S-ipv4": 116,
+    "S-ipv6": 120,
+    "S-language": 124,
+    "S-last_name": 128,
+    "S-license_plate": 132,
+    "S-mac_address": 136,
+    "S-medical_record_number": 140,
+    "S-national_id": 144,
+    "S-occupation": 148,
+    "S-password": 152,
+    "S-phone_number": 156,
+    "S-pin": 160,
+    "S-political_view": 164,
+    "S-postcode": 168,
+    "S-race_ethnicity": 172,
+    "S-religious_belief": 176,
+    "S-sexuality": 180,
+    "S-ssn": 184,
+    "S-state": 188,
+    "S-street_address": 192,
+    "S-swift_bic": 196,
+    "S-tax_id": 200,
+    "S-time": 204,
+    "S-unique_id": 208,
+    "S-url": 212,
+    "S-user_name": 216,
+    "S-vehicle_identifier": 220
+  },
+  "max_position_embeddings": 131072,
+  "model_type": "haremb_pii",
+  "num_attention_heads": 14,
+  "num_experts_per_tok": 4,
+  "num_hidden_layers": 1,
+  "num_key_value_heads": 2,
+  "num_local_experts": 128,
+  "opf_metadata": {
+    "category_version": "nemotron_fine_v1",
+    "encoding": "o200k_base",
+    "inference_contract_version": 1,
+    "ner_class_names": [
+      "O",
+      "B-account_number",
+      "I-account_number",
+      "E-account_number",
+      "S-account_number",
+      "B-age",
+      "I-age",
+      "E-age",
+      "S-age",
+      "B-api_key",
+      "I-api_key",
+      "E-api_key",
+      "S-api_key",
+      "B-bank_routing_number",
+      "I-bank_routing_number",
+      "E-bank_routing_number",
+      "S-bank_routing_number",
+      "B-biometric_identifier",
+      "I-biometric_identifier",
+      "E-biometric_identifier",
+      "S-biometric_identifier",
+      "B-blood_type",
+      "I-blood_type",
+      "E-blood_type",
+      "S-blood_type",
+      "B-certificate_license_number",
+      "I-certificate_license_number",
+      "E-certificate_license_number",
+      "S-certificate_license_number",
+      "B-city",
+      "I-city",
+      "E-city",
+      "S-city",
+      "B-company_name",
+      "I-company_name",
+      "E-company_name",
+      "S-company_name",
+      "B-coordinate",
+      "I-coordinate",
+      "E-coordinate",
+      "S-coordinate",
+      "B-country",
+      "I-country",
+      "E-country",
+      "S-country",
+      "B-county",
+      "I-county",
+      "E-county",
+      "S-county",
+      "B-credit_debit_card",
+      "I-credit_debit_card",
+      "E-credit_debit_card",
+      "S-credit_debit_card",
+      "B-customer_id",
+      "I-customer_id",
+      "E-customer_id",
+      "S-customer_id",
+      "B-cvv",
+      "I-cvv",
+      "E-cvv",
+      "S-cvv",
+      "B-date",
+      "I-date",
+      "E-date",
+      "S-date",
+      "B-date_of_birth",
+      "I-date_of_birth",
+      "E-date_of_birth",
+      "S-date_of_birth",
+      "B-date_time",
+      "I-date_time",
+      "E-date_time",
+      "S-date_time",
+      "B-device_identifier",
+      "I-device_identifier",
+      "E-device_identifier",
+      "S-device_identifier",
+      "B-education_level",
+      "I-education_level",
+      "E-education_level",
+      "S-education_level",
+      "B-email",
+      "I-email",
+      "E-email",
+      "S-email",
+      "B-employee_id",
+      "I-employee_id",
+      "E-employee_id",
+      "S-employee_id",
+      "B-employment_status",
+      "I-employment_status",
+      "E-employment_status",
+      "S-employment_status",
+      "B-fax_number",
+      "I-fax_number",
+      "E-fax_number",
+      "S-fax_number",
+      "B-first_name",
+      "I-first_name",
+      "E-first_name",
+      "S-first_name",
+      "B-gender",
+      "I-gender",
+      "E-gender",
+      "S-gender",
+      "B-health_plan_beneficiary_number",
+      "I-health_plan_beneficiary_number",
+      "E-health_plan_beneficiary_number",
+      "S-health_plan_beneficiary_number",
+      "B-http_cookie",
+      "I-http_cookie",
+      "E-http_cookie",
+      "S-http_cookie",
+      "B-ipv4",
+      "I-ipv4",
+      "E-ipv4",
+      "S-ipv4",
+      "B-ipv6",
+      "I-ipv6",
+      "E-ipv6",
+      "S-ipv6",
+      "B-language",
+      "I-language",
+      "E-language",
+      "S-language",
+      "B-last_name",
+      "I-last_name",
+      "E-last_name",
+      "S-last_name",
+      "B-license_plate",
+      "I-license_plate",
+      "E-license_plate",
+      "S-license_plate",
+      "B-mac_address",
+      "I-mac_address",
+      "E-mac_address",
+      "S-mac_address",
+      "B-medical_record_number",
+      "I-medical_record_number",
+      "E-medical_record_number",
+      "S-medical_record_number",
+      "B-national_id",
+      "I-national_id",
+      "E-national_id",
+      "S-national_id",
+      "B-occupation",
+      "I-occupation",
+      "E-occupation",
+      "S-occupation",
+      "B-password",
+      "I-password",
+      "E-password",
+      "S-password",
+      "B-phone_number",
+      "I-phone_number",
+      "E-phone_number",
+      "S-phone_number",
+      "B-pin",
+      "I-pin",
+      "E-pin",
+      "S-pin",
+      "B-political_view",
+      "I-political_view",
+      "E-political_view",
+      "S-political_view",
+      "B-postcode",
+      "I-postcode",
+      "E-postcode",
+      "S-postcode",
+      "B-race_ethnicity",
+      "I-race_ethnicity",
+      "E-race_ethnicity",
+      "S-race_ethnicity",
+      "B-religious_belief",
+      "I-religious_belief",
+      "E-religious_belief",
+      "S-religious_belief",
+      "B-sexuality",
+      "I-sexuality",
+      "E-sexuality",
+      "S-sexuality",
+      "B-ssn",
+      "I-ssn",
+      "E-ssn",
+      "S-ssn",
+      "B-state",
+      "I-state",
+      "E-state",
+      "S-state",
+      "B-street_address",
+      "I-street_address",
+      "E-street_address",
+      "S-street_address",
+      "B-swift_bic",
+      "I-swift_bic",
+      "E-swift_bic",
+      "S-swift_bic",
+      "B-tax_id",
+      "I-tax_id",
+      "E-tax_id",
+      "S-tax_id",
+      "B-time",
+      "I-time",
+      "E-time",
+      "S-time",
+      "B-unique_id",
+      "I-unique_id",
+      "E-unique_id",
+      "S-unique_id",
+      "B-url",
+      "I-url",
+      "E-url",
+      "S-url",
+      "B-user_name",
+      "I-user_name",
+      "E-user_name",
+      "S-user_name",
+      "B-vehicle_identifier",
+      "I-vehicle_identifier",
+      "E-vehicle_identifier",
+      "S-vehicle_identifier"
+    ],
+    "span_class_names": [
+      "O",
+      "account_number",
+      "age",
+      "api_key",
+      "bank_routing_number",
+      "biometric_identifier",
+      "blood_type",
+      "certificate_license_number",
+      "city",
+      "company_name",
+      "coordinate",
+      "country",
+      "county",
+      "credit_debit_card",
+      "customer_id",
+      "cvv",
+      "date",
+      "date_of_birth",
+      "date_time",
+      "device_identifier",
+      "education_level",
+      "email",
+      "employee_id",
+      "employment_status",
+      "fax_number",
+      "first_name",
+      "gender",
+      "health_plan_beneficiary_number",
+      "http_cookie",
+      "ipv4",
+      "ipv6",
+      "language",
+      "last_name",
+      "license_plate",
+      "mac_address",
+      "medical_record_number",
+      "national_id",
+      "occupation",
+      "password",
+      "phone_number",
+      "pin",
+      "political_view",
+      "postcode",
+      "race_ethnicity",
+      "religious_belief",
+      "sexuality",
+      "ssn",
+      "state",
+      "street_address",
+      "swift_bic",
+      "tax_id",
+      "time",
+      "unique_id",
+      "url",
+      "user_name",
+      "vehicle_identifier"
+    ]
+  },
+  "output_router_logits": false,
+  "pad_token_id": 199999,
+  "rms_norm_eps": 1e-05,
+  "rope_parameters": {
+    "beta_fast": 32.0,
+    "beta_slow": 1.0,
+    "factor": 32.0,
+    "original_max_position_embeddings": 4096,
+    "rope_theta": 150000.0,
+    "rope_type": "yarn",
+    "truncate": false
+  },
+  "router_aux_loss_coef": 0.001,
+  "sliding_window": 128,
+  "tie_word_embeddings": false,
+  "transformers.js_config": {
+    "use_external_data_format": {
+      "model": 1,
+      "model.onnx": 3,
+      "model_fp16.onnx": 2
+    }
+  },
+  "transformers_version": "5.7.0",
+  "use_cache": true,
+  "use_viterbi_decode": true,
+  "viterbi_replace_logits": true,
+  "vocab_size": 200064
+}

configuration_haremb_pii.py ADDED Viewed

	@@ -0,0 +1,47 @@

+"""
+HaremPiiConfig — subclass of OpenAIPrivacyFilterConfig that:
+  * sets `model_type="haremb_pii"` (so AutoConfig + auto_map dispatch works
+    with `trust_remote_code=True`)
+  * paired with HaremPiiForTokenClassification in modeling_haremb_pii.py
+    via `auto_map`
+This release is a 1-layer surgical slice of the OpenMed teacher:
+  * num_hidden_layers=1
+  * inference-only — Viterbi decoding is built into the forward pass.
+"""
+from __future__ import annotations
+from transformers.models.openai_privacy_filter.configuration_openai_privacy_filter import (
+    OpenAIPrivacyFilterConfig,
+)
+class HaremPiiConfig(OpenAIPrivacyFilterConfig):
+    """
+    HarEmb config. `model_type="haremb_pii"` disambiguates from upstream so
+    AutoConfig + AutoModel mappings can target our subclasses without
+    colliding with the registered OpenAIPrivacyFilterConfig entry.
+    `modeling_haremb_pii` performs the auto-registration at import time.
+    """
+    model_type = "haremb_pii"
+    def __init__(
+        self,
+        use_viterbi_decode: bool = True,
+        viterbi_replace_logits: bool = True,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        # When True (and model is in eval mode), HaremPiiForTokenClassification.forward
+        # runs constrained BIOES Viterbi over logits and attaches `predicted_labels`
+        # to the output. Set to False to skip Viterbi entirely.
+        self.use_viterbi_decode = bool(use_viterbi_decode)
+        # When True (and Viterbi is on), forward replaces `outputs.logits` with a
+        # one-hot-shaped tensor whose argmax equals the Viterbi prediction. This
+        # makes HF `pipeline()` and any naive `logits.argmax(-1)` consumer use
+        # Viterbi predictions automatically. The raw logits are preserved on
+        # the output as `raw_logits`.
+        self.viterbi_replace_logits = bool(viterbi_replace_logits)
+__all__ = ["HaremPiiConfig"]

eval_confusion.png ADDED Viewed

Git LFS Details

SHA256: d3f48f7c2692cd641800d883952b19b7aa3ca2b1e3cfbe9846f082dcbd2a2b16
Pointer size: 131 Bytes
Size of remote file: 114 kB

eval_performance.png ADDED Viewed

eval_summary.png ADDED Viewed

Git LFS Details

SHA256: 77c7da68b68cfa9223fc8423ad2be6124e3c14e6d9f49d1d683db2630f2f5b71
Pointer size: 131 Bytes
Size of remote file: 167 kB

haremb.png ADDED Viewed

Git LFS Details

SHA256: 3d53d359c14545f778e4789ac53d40eb8a4b0fc0ed7bf05a72c101ee60e33db1
Pointer size: 131 Bytes
Size of remote file: 767 kB

infer.log ADDED Viewed

	@@ -0,0 +1,51 @@

+Inference benchmark: A: openmed-base vs B: haremb
+  device  : cuda  dtype: torch.bfloat16
+  ctx     : 1024
+A: openmed-base (reference / teacher)
+  load    : 0.71s
+  eval    : 64.66s on 212,909 tokens (3293 tok/s)
+Performance:
+  total params              :   1399.61M  (139.35M dense + 1260.26M MoE-experts)
+  active params / token     :    178.73M  (memory footprint — embed lookup + top_4/128 experts: 128.04M embed + 39.38M MoE-active + 11.31M attn/norm/head)
+  compute params / token    :     50.69M  (matmul FLOPs only — embedding lookup excluded)
+  GFLOP / token (fwd, MAC×2):     0.101
+  weights size (on disk)    :         —
+  weights size (in RAM)     :  2.61 GiB
+  weights resident (GPU)    :  2.61 GiB
+  peak GPU mem (eval, ctx=1024)    :  3.30 GiB
+B: haremb (this checkpoint)
+  load    : 0.10s
+  eval    : 33.56s on 212,909 tokens (6343 tok/s)
+Performance:
+  total params              :    287.11M  (129.58M dense + 157.53M MoE-experts)
+  active params / token     :    134.50M  (memory footprint — embed lookup + top_4/128 experts: 128.04M embed + 4.92M MoE-active + 1.54M attn/norm/head)
+  compute params / token    :      6.46M  (matmul FLOPs only — embedding lookup excluded)
+  GFLOP / token (fwd, MAC×2):     0.013
+  weights size (on disk)    : 547.6 MiB
+  weights size (in RAM)     : 547.6 MiB
+  weights resident (GPU)    : 548.3 MiB
+  peak GPU mem (eval, ctx=1024)    :  1.22 GiB
+B vs A  (haremb vs openmed-base):
+  total params              : 4.87× smaller
+  active params / token     : 1.33× less  [memory]
+  compute params / token    : 7.85× cheaper  [FLOPs]
+  GFLOP / token             : 7.85× cheaper
+  weights size (on disk)    : —
+  weights in RAM            : 4.87× smaller
+  peak GPU mem (eval)       : 2.70× less
+  throughput                : 1.93× faster
+Sample inference (load → tokenize → forward → viterbi-decode → spans):
+  text: 'Patient Sarah Johnson (DOB 03/15/1985), MRN 4872910, phone 415-555-0123, email sarah.johnson@example.com, credit card 4111-1111-1111-1111.'
+  forward latency: 65.8ms (53 tokens)
+  detected 7 spans:
+    [  1,   2)  first_name                    'Sarah'
+    [  2,   3)  last_name                     'Johnson'
+    [  6,  12)  date                          '03/15/1985'
+    [ 16,  19)  phone_number                  '4872910'
+    [ 22,  28)  phone_number                  '415-555-0123'
+    [ 30,  37)  email                         'sarah.johnson@example.com'
+    [ 41,  52)  credit_debit_card             '4111-1111-1111-1111'

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:64059006ac732bd608cc478ee579cc96f56174d8910603b6d4747688b130b8a2
+size 574224842

modeling_haremb_pii.py ADDED Viewed

	@@ -0,0 +1,270 @@

+"""
+HaremPii — 1-layer surgical inference wrapper over OpenAI Privacy Filter.
+Defines:
+  * HaremPiiForTokenClassification — subclass of
+    OpenAIPrivacyFilterForTokenClassification. Reuses the upstream forward
+    pass and adds eval-time constrained-BIOES Viterbi decoding so
+    `outputs.logits.argmax(-1)` returns the Viterbi path.
+  * HaremPiiModel — encoder alias pinned to HaremPiiConfig.
+The model class is auto-registered so
+`AutoModelForTokenClassification.from_pretrained(repo, trust_remote_code=True)`
+dispatches to us via `config.auto_map` (model_type "haremb_pii").
+This file is the released, inference-only copy. It contains no
+training-related utilities.
+"""
+from __future__ import annotations
+from typing import Optional
+import torch
+import torch.nn as nn
+from transformers import (
+    AutoConfig,
+    AutoModel,
+    AutoModelForTokenClassification,
+)
+from transformers.models.openai_privacy_filter.modeling_openai_privacy_filter import (
+    OpenAIPrivacyFilterForTokenClassification,
+    OpenAIPrivacyFilterModel,
+)
+from configuration_haremb_pii import HaremPiiConfig
+# ---------------------------------------------------------------------------
+# Constrained BIOES Viterbi (inlined so the checkpoint is self-contained)
+# ---------------------------------------------------------------------------
+# Transition rules:
+#   O    -> {O, B-X, S-X}
+#   B-X  -> {I-X, E-X}
+#   I-X  -> {I-X, E-X}
+#   E-X  -> {O, B-Y, S-Y}
+#   S-X  -> {O, B-Y, S-Y}
+# Initial state allows {O, B-X, S-X} only.
+def _parse_bioes(label: str):
+    if label == "O" or "-" not in label:
+        return "O", None
+    pref, cat = label.split("-", 1)
+    return pref, cat
+def _build_bioes_transition_mask(id2label) -> torch.Tensor:
+    C = len(id2label)
+    mask = torch.full((C, C), float("-inf"))
+    parsed = {i: _parse_bioes(id2label[i]) for i in range(C)}
+    for i, (p_prev, c_prev) in parsed.items():
+        for j, (p_cur, c_cur) in parsed.items():
+            ok = False
+            if p_prev == "O":
+                if p_cur in ("O", "B", "S"):
+                    ok = True
+            elif p_prev == "B":
+                if p_cur in ("I", "E") and c_cur == c_prev:
+                    ok = True
+            elif p_prev == "I":
+                if p_cur in ("I", "E") and c_cur == c_prev:
+                    ok = True
+            elif p_prev in ("E", "S"):
+                if p_cur in ("O", "B", "S"):
+                    ok = True
+            if ok:
+                mask[i, j] = 0.0
+    return mask
+def _build_bioes_initial_mask(id2label) -> torch.Tensor:
+    C = len(id2label)
+    mask = torch.full((C,), float("-inf"))
+    for i, lbl in id2label.items():
+        p, _ = _parse_bioes(lbl)
+        if p in ("O", "B", "S"):
+            mask[i] = 0.0
+    return mask
+def _bioes_viterbi(
+    logits: torch.Tensor,
+    transition_mask: torch.Tensor,
+    initial_mask: torch.Tensor,
+) -> torch.Tensor:
+    if logits.dim() != 2:
+        raise ValueError(f"expected 2D logits, got {logits.shape}")
+    T = logits.shape[0]
+    mask = torch.ones((1, T), dtype=torch.long, device=logits.device)
+    out = _bioes_viterbi_batched(
+        logits.unsqueeze(0), mask, transition_mask, initial_mask,
+    )
+    return out[0]
+def _bioes_viterbi_batched(
+    logits: torch.Tensor,
+    attention_mask: torch.Tensor,
+    transition_mask: torch.Tensor,
+    initial_mask: torch.Tensor,
+) -> torch.Tensor:
+    """Vectorized constrained BIOES Viterbi.
+    Args:
+      logits:          [B, T, C] float
+      attention_mask:  [B, T] {0, 1} long/bool
+      transition_mask: [C, C] 0 valid, -inf invalid
+      initial_mask:    [C]    0 allowed first tag, -inf forbidden
+    Returns:
+      [B, T] LongTensor of best constrained-BIOES tag id per token; padded
+      positions hold -1.
+    """
+    if logits.dim() != 3:
+        raise ValueError(f"expected 3D logits [B,T,C], got {logits.shape}")
+    device = logits.device
+    B, T, C = logits.shape
+    scores = logits.float()
+    trans = transition_mask.to(device).float()
+    init  = initial_mask.to(device).float()
+    mask  = attention_mask.to(device).bool()
+    dp = scores[:, 0] + init.unsqueeze(0)
+    back = torch.zeros((B, T, C), dtype=torch.long, device=device)
+    trans_b = trans.unsqueeze(0)
+    for t in range(1, T):
+        cand = dp.unsqueeze(2) + trans_b
+        best_val, best_prev = cand.max(dim=1)
+        new_dp = best_val + scores[:, t]
+        keep = mask[:, t].unsqueeze(1)
+        dp = torch.where(keep, new_dp, dp)
+        back[:, t] = best_prev
+    last_t = (mask.sum(dim=1) - 1).clamp_min(0)
+    best_last = dp.argmax(dim=1)
+    out = torch.full((B, T), -1, dtype=torch.long, device=device)
+    batch_idx = torch.arange(B, device=device)
+    out[batch_idx, last_t] = best_last
+    current = best_last.clone()
+    for t in range(T - 1, 0, -1):
+        new_current = torch.gather(
+            back[:, t, :], 1, current.unsqueeze(1)
+        ).squeeze(1)
+        active = (t <= last_t)
+        current = torch.where(active, new_current, current)
+        out[batch_idx, t - 1] = torch.where(
+            active, current, out[batch_idx, t - 1],
+        )
+    return out
+# ---------------------------------------------------------------------------
+# Architecture classes
+# ---------------------------------------------------------------------------
+class HaremPiiModel(OpenAIPrivacyFilterModel):
+    """Thin alias of the upstream encoder pinned to HaremPiiConfig."""
+    config_class = HaremPiiConfig
+class HaremPiiForTokenClassification(OpenAIPrivacyFilterForTokenClassification):
+    """1-layer student. Wraps the upstream forward with eval-time
+    constrained-BIOES Viterbi decoding."""
+    config_class = HaremPiiConfig
+    def __init__(self, config: HaremPiiConfig):
+        # Bypass GenericForTokenClassification.__init__ because it calls
+        # AutoModel.from_config(config), which uses type(config) as the
+        # registry key. Under the trust_remote_code Hub-loading path the
+        # cached HaremPiiConfig class identity differs from whatever was
+        # registered at module import (the cache hosts the class under a
+        # synthetic, sha-qualified module name). Constructing the encoder
+        # directly avoids the registry dispatch entirely.
+        from transformers.modeling_utils import PreTrainedModel as _PreTrainedModel
+        _PreTrainedModel.__init__(self, config)
+        self.num_labels = config.num_labels
+        self.model = OpenAIPrivacyFilterModel(config)
+        if getattr(config, "classifier_dropout", None) is not None:
+            classifier_dropout = config.classifier_dropout
+        elif getattr(config, "hidden_dropout", None) is not None:
+            classifier_dropout = config.hidden_dropout
+        else:
+            classifier_dropout = 0.1
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.score = nn.Linear(config.hidden_size, config.num_labels)
+        self.post_init()
+        self._viterbi_trans_mask = None
+        self._viterbi_init_mask = None
+    def _ensure_viterbi_masks(self):
+        if self._viterbi_trans_mask is None:
+            id2label = {int(k): v for k, v in self.config.id2label.items()}
+            self._viterbi_trans_mask = _build_bioes_transition_mask(id2label)
+            self._viterbi_init_mask = _build_bioes_initial_mask(id2label)
+        return self._viterbi_trans_mask, self._viterbi_init_mask
+    @torch.no_grad()
+    def decode_predictions(
+        self,
+        logits: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+    ) -> torch.Tensor:
+        trans, init = self._ensure_viterbi_masks()
+        if logits.dim() == 2:
+            T = logits.shape[0]
+            mask = torch.ones((1, T), dtype=torch.long, device=logits.device)
+            return _bioes_viterbi_batched(
+                logits.unsqueeze(0), mask, trans, init,
+            )[0]
+        if attention_mask is None:
+            attention_mask = torch.ones(
+                logits.shape[:2], dtype=torch.long, device=logits.device,
+            )
+        return _bioes_viterbi_batched(logits, attention_mask, trans, init)
+    def forward(self, *args, **kwargs):
+        outputs = super().forward(*args, **kwargs)
+        if self.training:
+            return outputs
+        if not getattr(self.config, "use_viterbi_decode", True):
+            return outputs
+        attn_mask = kwargs.get("attention_mask", None)
+        if attn_mask is None and len(args) >= 2:
+            attn_mask = args[1]
+        decoded = self.decode_predictions(outputs.logits, attention_mask=attn_mask)
+        try:
+            outputs.predicted_labels = decoded
+        except Exception:
+            outputs.__dict__["predicted_labels"] = decoded
+        if getattr(self.config, "viterbi_replace_logits", True):
+            raw = outputs.logits
+            fake = torch.full_like(raw, fill_value=-1e9)
+            fake.scatter_(-1, decoded.clamp_min(0).unsqueeze(-1), 1e9)
+            try:
+                outputs.raw_logits = raw
+                outputs.logits = fake
+            except Exception:
+                outputs.__dict__["raw_logits"] = raw
+                outputs.__dict__["logits"] = fake
+        return outputs
+# --- Auto-registry ---
+AutoConfig.register("haremb_pii", HaremPiiConfig, exist_ok=True)
+AutoModel.register(HaremPiiConfig, HaremPiiModel, exist_ok=True)
+AutoModelForTokenClassification.register(
+    HaremPiiConfig, HaremPiiForTokenClassification, exist_ok=True,
+)
+HaremPiiConfig.register_for_auto_class("AutoConfig")
+HaremPiiForTokenClassification.register_for_auto_class("AutoModelForTokenClassification")
+__all__ = [
+    "HaremPiiConfig",
+    "HaremPiiModel",
+    "HaremPiiForTokenClassification",
+]

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0614fe83cadab421296e664e1f48f4261fa8fef6e03e63bb75c20f38e37d07d3
+size 27868174

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "backend": "tokenizers",
+  "eos_token": "<|endoftext|>",
+  "is_local": false,
+  "local_files_only": false,
+  "model_input_names": [
+    "input_ids",
+    "attention_mask"
+  ],
+  "model_max_length": 128000,
+  "pad_token": "<|endoftext|>",
+  "tokenizer_class": "TokenizersBackend"
+}