reframe: PII redactor for accessibility / OCR / computer-use
Browse files
README.md
CHANGED
|
@@ -1,32 +1,74 @@
|
|
| 1 |
# screenpipe-pii-redactor
|
| 2 |
|
| 3 |
-
A
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
Built on top of the [OpenAI Privacy Filter](https://github.com/openai/privacy-filter)
|
| 10 |
(1.5B parameters, 50M active). Fine-tuned on a mixed corpus combining
|
| 11 |
-
synthetic window-title data, a slice of
|
| 12 |
[ai4privacy/pii-masking-300k](https://huggingface.co/datasets/ai4privacy/pii-masking-300k),
|
| 13 |
-
and targeted secret-shape augmentation
|
|
|
|
| 14 |
|
| 15 |
-
> **License: CC BY-NC 4.0** (non-commercial). For commercial use
|
| 16 |
-
>
|
|
|
|
|
|
|
| 17 |
|
| 18 |
## TL;DR
|
| 19 |
|
| 20 |
| | base OPF | **this model** | gap |
|
| 21 |
|---|---:|---:|---:|
|
| 22 |
-
|
|
| 23 |
-
| Long-form PII zero-leak — PII-Masking-300k EN (n=1000) | 14.0% (11.7–16.2) | **77.5% (74.5–80.3)** | **+63.5 pp** |
|
| 24 |
| Macro-F1 on 300k EN | 0.591 | **0.934** | +0.343 |
|
| 25 |
| Targeted secret-redaction probe (n=34 realistic shapes) | not measured | **31/34 strict** | — |
|
| 26 |
| p50 inference latency (CUDA) | ~23 ms | ~23 ms | flat |
|
| 27 |
|
| 28 |
All gaps statistically significant (non-overlapping 95 % bootstrap CIs).
|
| 29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
## What it does
|
| 31 |
|
| 32 |
Span-level redaction. Given a string, returns `[(start, end, label, text)]`
|
|
@@ -85,7 +127,7 @@ All numbers come from a held-out benchmark (private; access available
|
|
| 85 |
under commercial license). 95 % bootstrap CIs (1,000 resamples) on
|
| 86 |
zero-leak rate.
|
| 87 |
|
| 88 |
-
###
|
| 89 |
|
| 90 |
| Adapter | Zero-leak | Oversmash | Macro-F1 | Micro-F1 | p50 (ms) |
|
| 91 |
|---|---:|---:|---:|---:|---:|
|
|
@@ -199,10 +241,10 @@ SaaS / API embedding, custom fine-tunes for your domain): **hi@screenpi.pe**.
|
|
| 199 |
|
| 200 |
```bibtex
|
| 201 |
@misc{screenpipe-pii-redactor-2026,
|
| 202 |
-
title = {screenpipe-pii-redactor: a
|
| 203 |
-
|
| 204 |
-
author = {
|
| 205 |
year = {2026},
|
| 206 |
-
url = {https://
|
| 207 |
}
|
| 208 |
```
|
|
|
|
| 1 |
# screenpipe-pii-redactor
|
| 2 |
|
| 3 |
+
> A [screenpipe](https://screenpi.pe) project.
|
| 4 |
+
|
| 5 |
+
A fine-tuned PII redactor for the **three surfaces an AI agent actually
|
| 6 |
+
sees a user's machine through**:
|
| 7 |
+
|
| 8 |
+
1. **Accessibility-tree dumps** — the structured AX hierarchy macOS /
|
| 9 |
+
Windows expose to assistive tech. Short, structured, often containing
|
| 10 |
+
labels like `AXButton[Send to marcus@helios-ai.io]`.
|
| 11 |
+
2. **OCR'd screen text** — what tools like screenpipe extract from
|
| 12 |
+
screen recordings. Mix of window-title-shaped artifacts, app chrome,
|
| 13 |
+
and occasional long-form (emails, docs).
|
| 14 |
+
3. **Computer-use traces** — what an agentic model (Claude Computer Use,
|
| 15 |
+
GPT operator, etc.) reads when it controls a desktop. Mix of all of
|
| 16 |
+
the above plus interaction-trace metadata.
|
| 17 |
+
|
| 18 |
+
These surfaces are short, sparse-context, and full of identifiers that
|
| 19 |
+
slip past redactors trained on chat-style prose. This model is fine-tuned
|
| 20 |
+
specifically for them — while still handling long-form text (chat
|
| 21 |
+
transcripts, document body, support tickets) at competitive accuracy.
|
| 22 |
|
| 23 |
Built on top of the [OpenAI Privacy Filter](https://github.com/openai/privacy-filter)
|
| 24 |
(1.5B parameters, 50M active). Fine-tuned on a mixed corpus combining
|
| 25 |
+
synthetic accessibility / window-title / OCR data, a slice of
|
| 26 |
[ai4privacy/pii-masking-300k](https://huggingface.co/datasets/ai4privacy/pii-masking-300k),
|
| 27 |
+
and targeted secret-shape augmentation (API keys, JWTs, DB connection
|
| 28 |
+
strings, private-key block markers, password prompts).
|
| 29 |
|
| 30 |
+
> **License: CC BY-NC 4.0** (non-commercial). For commercial use —
|
| 31 |
+
> production redaction, SaaS / API embedding, AI-agent privacy
|
| 32 |
+
> middleware, custom fine-tunes — contact **hi@screenpi.pe**. See
|
| 33 |
+
> [`LICENSE`](LICENSE).
|
| 34 |
|
| 35 |
## TL;DR
|
| 36 |
|
| 37 |
| | base OPF | **this model** | gap |
|
| 38 |
|---|---:|---:|---:|
|
| 39 |
+
| Accessibility / window-title PII zero-leak (n=422) | 38.6% (33.6–43.8) | **79.1% (74.8–83.5)** | **+40.5 pp** |
|
| 40 |
+
| Long-form text PII zero-leak — PII-Masking-300k EN (n=1000) | 14.0% (11.7–16.2) | **77.5% (74.5–80.3)** | **+63.5 pp** |
|
| 41 |
| Macro-F1 on 300k EN | 0.591 | **0.934** | +0.343 |
|
| 42 |
| Targeted secret-redaction probe (n=34 realistic shapes) | not measured | **31/34 strict** | — |
|
| 43 |
| p50 inference latency (CUDA) | ~23 ms | ~23 ms | flat |
|
| 44 |
|
| 45 |
All gaps statistically significant (non-overlapping 95 % bootstrap CIs).
|
| 46 |
|
| 47 |
+
## Why this exists (vs the base Privacy Filter)
|
| 48 |
+
|
| 49 |
+
The OpenAI Privacy Filter (and most other public PII redactors) is
|
| 50 |
+
trained on prose-shaped data — letters, Q&A turns, chat corpora.
|
| 51 |
+
A typical accessibility-tree node, OCR'd window title, or computer-use
|
| 52 |
+
log line looks nothing like that:
|
| 53 |
+
|
| 54 |
+
```
|
| 55 |
+
AXButton[Send to marcus@helios-ai.io]
|
| 56 |
+
Welcome | Acme Corp | xAI Console
|
| 57 |
+
[ScreenCapture 09:14] Slack — #compai-tessera (12 unread)
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
These are 30-character strings with one or two PII tokens and almost
|
| 61 |
+
no surrounding context. A model trained on chat corpora will conflate
|
| 62 |
+
brand names with people, miss `Arc | Marcus Chen` because it expects
|
| 63 |
+
sentence context, and tag `Raycast` and `Claude` as people. The base
|
| 64 |
+
Privacy Filter scored 38.6 % zero-leak on this surface; this model
|
| 65 |
+
scores 79.1 %.
|
| 66 |
+
|
| 67 |
+
If you're building an **agentic system that reads screen state** — a
|
| 68 |
+
desktop-control agent, a memory layer for browsing, anything that
|
| 69 |
+
streams accessibility/OCR/screen-capture data into an LLM — this is
|
| 70 |
+
the redactor designed for that pipe.
|
| 71 |
+
|
| 72 |
## What it does
|
| 73 |
|
| 74 |
Span-level redaction. Given a string, returns `[(start, end, label, text)]`
|
|
|
|
| 127 |
under commercial license). 95 % bootstrap CIs (1,000 resamples) on
|
| 128 |
zero-leak rate.
|
| 129 |
|
| 130 |
+
### Accessibility / window-title PII (n=422 — 345 with gold spans, 77 negatives)
|
| 131 |
|
| 132 |
| Adapter | Zero-leak | Oversmash | Macro-F1 | Micro-F1 | p50 (ms) |
|
| 133 |
|---|---:|---:|---:|---:|---:|
|
|
|
|
| 241 |
|
| 242 |
```bibtex
|
| 243 |
@misc{screenpipe-pii-redactor-2026,
|
| 244 |
+
title = {screenpipe-pii-redactor: a PII redactor for accessibility
|
| 245 |
+
trees, OCR'd screen text, and computer-use traces},
|
| 246 |
+
author = {{screenpipe}},
|
| 247 |
year = {2026},
|
| 248 |
+
url = {https://huggingface.co/screenpipe/pii-redactor}
|
| 249 |
}
|
| 250 |
```
|