screenpipe
/

pii-redactor

@@ -1,32 +1,74 @@
 # screenpipe-pii-redactor
-A fine-tuned PII redactor specialized for **desktop activity logs** — the
-short, sparse-context strings produced by screen-recording tools (window
-titles, browser tabs, IDE buffers, calendar entries) — while still
-handling long-form text (chat transcripts, document body, support
-tickets) at competitive accuracy with the upstream baseline.
 Built on top of the [OpenAI Privacy Filter](https://github.com/openai/privacy-filter)
 (1.5B parameters, 50M active). Fine-tuned on a mixed corpus combining
-synthetic window-title data, a slice of
 [ai4privacy/pii-masking-300k](https://huggingface.co/datasets/ai4privacy/pii-masking-300k),
-and targeted secret-shape augmentation.
-> **License: CC BY-NC 4.0** (non-commercial). For commercial use,
-> contact **hi@screenpi.pe** — see [`LICENSE`](LICENSE).
 ## TL;DR
 |  | base OPF | **this model** | gap |
 |---|---:|---:|---:|
-| Window-title PII zero-leak (n=422) | 38.6% (33.6–43.8) | **79.1% (74.8–83.5)** | **+40.5 pp** |
-| Long-form PII zero-leak — PII-Masking-300k EN (n=1000) | 14.0% (11.7–16.2) | **77.5% (74.5–80.3)** | **+63.5 pp** |
 | Macro-F1 on 300k EN | 0.591 | **0.934** | +0.343 |
 | Targeted secret-redaction probe (n=34 realistic shapes) | not measured | **31/34 strict** | — |
 | p50 inference latency (CUDA) | ~23 ms | ~23 ms | flat |
 All gaps statistically significant (non-overlapping 95 % bootstrap CIs).
 ## What it does
 Span-level redaction. Given a string, returns `[(start, end, label, text)]`
@@ -85,7 +127,7 @@ All numbers come from a held-out benchmark (private; access available
 under commercial license). 95 % bootstrap CIs (1,000 resamples) on
 zero-leak rate.
-### Window-title PII (n=422 — 345 with gold spans, 77 negatives)
 | Adapter | Zero-leak | Oversmash | Macro-F1 | Micro-F1 | p50 (ms) |
 |---|---:|---:|---:|---:|---:|
@@ -199,10 +241,10 @@ SaaS / API embedding, custom fine-tunes for your domain): **hi@screenpi.pe**.
 ```bibtex
 @misc{screenpipe-pii-redactor-2026,
-  title  = {screenpipe-pii-redactor: a fine-tuned PII redactor for
-            desktop activity logs},
-  author = {Beaumont, Louis},
   year   = {2026},
-  url    = {https://github.com/screenpipe/pii-redactor}
 }
 ```

 # screenpipe-pii-redactor
+> A [screenpipe](https://screenpi.pe) project.
+A fine-tuned PII redactor for the **three surfaces an AI agent actually
+sees a user's machine through**:
+1. **Accessibility-tree dumps** — the structured AX hierarchy macOS /
+   Windows expose to assistive tech. Short, structured, often containing
+   labels like `AXButton[Send to marcus@helios-ai.io]`.
+2. **OCR'd screen text** — what tools like screenpipe extract from
+   screen recordings. Mix of window-title-shaped artifacts, app chrome,
+   and occasional long-form (emails, docs).
+3. **Computer-use traces** — what an agentic model (Claude Computer Use,
+   GPT operator, etc.) reads when it controls a desktop. Mix of all of
+   the above plus interaction-trace metadata.
+These surfaces are short, sparse-context, and full of identifiers that
+slip past redactors trained on chat-style prose. This model is fine-tuned
+specifically for them — while still handling long-form text (chat
+transcripts, document body, support tickets) at competitive accuracy.
 Built on top of the [OpenAI Privacy Filter](https://github.com/openai/privacy-filter)
 (1.5B parameters, 50M active). Fine-tuned on a mixed corpus combining
+synthetic accessibility / window-title / OCR data, a slice of
 [ai4privacy/pii-masking-300k](https://huggingface.co/datasets/ai4privacy/pii-masking-300k),
+and targeted secret-shape augmentation (API keys, JWTs, DB connection
+strings, private-key block markers, password prompts).
+> **License: CC BY-NC 4.0** (non-commercial). For commercial use —
+> production redaction, SaaS / API embedding, AI-agent privacy
+> middleware, custom fine-tunes — contact **hi@screenpi.pe**. See
+> [`LICENSE`](LICENSE).
 ## TL;DR
 |  | base OPF | **this model** | gap |
 |---|---:|---:|---:|
+| Accessibility / window-title PII zero-leak (n=422) | 38.6% (33.6–43.8) | **79.1% (74.8–83.5)** | **+40.5 pp** |
+| Long-form text PII zero-leak — PII-Masking-300k EN (n=1000) | 14.0% (11.7–16.2) | **77.5% (74.5–80.3)** | **+63.5 pp** |
 | Macro-F1 on 300k EN | 0.591 | **0.934** | +0.343 |
 | Targeted secret-redaction probe (n=34 realistic shapes) | not measured | **31/34 strict** | — |
 | p50 inference latency (CUDA) | ~23 ms | ~23 ms | flat |
 All gaps statistically significant (non-overlapping 95 % bootstrap CIs).
+## Why this exists (vs the base Privacy Filter)
+The OpenAI Privacy Filter (and most other public PII redactors) is
+trained on prose-shaped data — letters, Q&A turns, chat corpora.
+A typical accessibility-tree node, OCR'd window title, or computer-use
+log line looks nothing like that:
+```
+AXButton[Send to marcus@helios-ai.io]
+Welcome | Acme Corp | xAI Console
+[ScreenCapture 09:14] Slack — #compai-tessera (12 unread)
+```
+These are 30-character strings with one or two PII tokens and almost
+no surrounding context. A model trained on chat corpora will conflate
+brand names with people, miss `Arc | Marcus Chen` because it expects
+sentence context, and tag `Raycast` and `Claude` as people. The base
+Privacy Filter scored 38.6 % zero-leak on this surface; this model
+scores 79.1 %.
+If you're building an **agentic system that reads screen state** — a
+desktop-control agent, a memory layer for browsing, anything that
+streams accessibility/OCR/screen-capture data into an LLM — this is
+the redactor designed for that pipe.
 ## What it does
 Span-level redaction. Given a string, returns `[(start, end, label, text)]`
 under commercial license). 95 % bootstrap CIs (1,000 resamples) on
 zero-leak rate.
+### Accessibility / window-title PII (n=422 — 345 with gold spans, 77 negatives)
 | Adapter | Zero-leak | Oversmash | Macro-F1 | Micro-F1 | p50 (ms) |
 |---|---:|---:|---:|---:|---:|
 ```bibtex
 @misc{screenpipe-pii-redactor-2026,
+  title  = {screenpipe-pii-redactor: a PII redactor for accessibility
+            trees, OCR'd screen text, and computer-use traces},
+  author = {{screenpipe}},
   year   = {2026},
+  url    = {https://huggingface.co/screenpipe/pii-redactor}
 }
 ```