screenpipe
/

pii-redactor

@@ -30,7 +30,7 @@ metrics:
 extra_gated_prompt: >-
   This model is licensed CC BY-NC 4.0 (non-commercial). For commercial
   use — production deployment, SaaS / API embedding, agent privacy
-  middleware, custom fine-tunes — contact hi@screenpi.pe.
 ---
 # screenpipe-pii-redactor
@@ -64,7 +64,7 @@ strings, private-key block markers, password prompts).
 > **License: CC BY-NC 4.0** (non-commercial). For commercial use —
 > production redaction, SaaS / API embedding, AI-agent privacy
-> middleware, custom fine-tunes — contact **hi@screenpi.pe**. See
 > [`LICENSE`](LICENSE).
 ## TL;DR
@@ -181,6 +181,33 @@ zero-leak rate.
 | previous internal version | 74.5% (71.8–77.5) | 9.1% | 0.763 | 0.932 |
 | OpenAI Privacy Filter (base) | 14.0% (11.7–16.2) | 16.5% | 0.591 | 0.579 |
 ### Multilingual generalization (n=200 per language)
 This model was trained on English-only data. Cross-language transfer:
@@ -262,7 +289,7 @@ python examples/inference.py
 ```
 Verifying the eval scores requires the held-out benchmark. Contact
-**hi@screenpi.pe** for benchmark access if you have a research or
 commercial use case.
 ## License
@@ -270,7 +297,7 @@ commercial use case.
 [CC BY-NC 4.0](LICENSE) — non-commercial use only.
 For commercial licensing (production deployment, redistribution rights,
-SaaS / API embedding, custom fine-tunes for your domain): **hi@screenpi.pe**.
 ## Citation

 extra_gated_prompt: >-
   This model is licensed CC BY-NC 4.0 (non-commercial). For commercial
   use — production deployment, SaaS / API embedding, agent privacy
+  middleware, custom fine-tunes — contact hi@louis030195.com.
 ---
 # screenpipe-pii-redactor
 > **License: CC BY-NC 4.0** (non-commercial). For commercial use —
 > production redaction, SaaS / API embedding, AI-agent privacy
+> middleware, custom fine-tunes — contact **hi@louis030195.com**. See
 > [`LICENSE`](LICENSE).
 ## TL;DR
 | previous internal version | 74.5% (71.8–77.5) | 9.1% | 0.763 | 0.932 |
 | OpenAI Privacy Filter (base) | 14.0% (11.7–16.2) | 16.5% | 0.591 | 0.579 |
+> **What "14% zero-leak" for the base actually means** (read this before
+> citing the gap). Zero-leak is a strict, taxonomy-coupled metric: a
+> single example counts as "leaked" if the model misses ANY gold span
+> in it under our 12-class label mapping. The published OpenAI Privacy
+> Filter result is **F1 ≈ 96 %** on PII-Masking-300k under THEIR
+> 49-class taxonomy — that's a much more lenient setup. The base scores
+> 14 % zero-leak under our metric for two compounding reasons:
+>
+> 1. **Label-space mismatch** dominates. We map 28 source 300k labels
+>    into our 12 classes; the base model can't predict our label names.
+>    On categories where the base's native taxonomy DOES align with ours
+>    (`private_email`, `private_phone`, `private_url`, `secret`), the
+>    base scores **0.90–1.00 recall** — strong. On categories where it
+>    doesn't (`private_id` covering IDCARD/SOCIALNUMBER/PASSPORT,
+>    `private_handle` covering USERNAME), it scores **0.00** by
+>    definition because it never emits the right label.
+> 2. **Zero-leak is all-or-nothing per example.** With ~6 spans per
+>    300k example and any unmappable category present, base fails the
+>    whole example. Token-level F1 (0.591 above) is the more honest
+>    cross-comparison number.
+>
+> The +63 pp claim **is** real and useful for the deployment context
+> (anyone shipping a system that needs the screenpipe 12-class
+> taxonomy gets +63 pp out of the box vs the base). It would be
+> misleading to read it as "this model is 5× more accurate at PII
+> detection" — that's not what the metric measures.
 ### Multilingual generalization (n=200 per language)
 This model was trained on English-only data. Cross-language transfer:
 ```
 Verifying the eval scores requires the held-out benchmark. Contact
+**hi@louis030195.com** for benchmark access if you have a research or
 commercial use case.
 ## License
 [CC BY-NC 4.0](LICENSE) — non-commercial use only.
 For commercial licensing (production deployment, redistribution rights,
+SaaS / API embedding, custom fine-tunes for your domain): **hi@louis030195.com**.
 ## Citation