louis030195 commited on
Commit
e4ebe6f
·
verified ·
1 Parent(s): 5436158

reframe: PII redactor for accessibility / OCR / computer-use

Browse files
Files changed (1) hide show
  1. README.md +58 -16
README.md CHANGED
@@ -1,32 +1,74 @@
1
  # screenpipe-pii-redactor
2
 
3
- A fine-tuned PII redactor specialized for **desktop activity logs** — the
4
- short, sparse-context strings produced by screen-recording tools (window
5
- titles, browser tabs, IDE buffers, calendar entries) while still
6
- handling long-form text (chat transcripts, document body, support
7
- tickets) at competitive accuracy with the upstream baseline.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
 
9
  Built on top of the [OpenAI Privacy Filter](https://github.com/openai/privacy-filter)
10
  (1.5B parameters, 50M active). Fine-tuned on a mixed corpus combining
11
- synthetic window-title data, a slice of
12
  [ai4privacy/pii-masking-300k](https://huggingface.co/datasets/ai4privacy/pii-masking-300k),
13
- and targeted secret-shape augmentation.
 
14
 
15
- > **License: CC BY-NC 4.0** (non-commercial). For commercial use,
16
- > contact **hi@screenpi.pe** see [`LICENSE`](LICENSE).
 
 
17
 
18
  ## TL;DR
19
 
20
  | | base OPF | **this model** | gap |
21
  |---|---:|---:|---:|
22
- | Window-title PII zero-leak (n=422) | 38.6% (33.6–43.8) | **79.1% (74.8–83.5)** | **+40.5 pp** |
23
- | Long-form PII zero-leak — PII-Masking-300k EN (n=1000) | 14.0% (11.7–16.2) | **77.5% (74.5–80.3)** | **+63.5 pp** |
24
  | Macro-F1 on 300k EN | 0.591 | **0.934** | +0.343 |
25
  | Targeted secret-redaction probe (n=34 realistic shapes) | not measured | **31/34 strict** | — |
26
  | p50 inference latency (CUDA) | ~23 ms | ~23 ms | flat |
27
 
28
  All gaps statistically significant (non-overlapping 95 % bootstrap CIs).
29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  ## What it does
31
 
32
  Span-level redaction. Given a string, returns `[(start, end, label, text)]`
@@ -85,7 +127,7 @@ All numbers come from a held-out benchmark (private; access available
85
  under commercial license). 95 % bootstrap CIs (1,000 resamples) on
86
  zero-leak rate.
87
 
88
- ### Window-title PII (n=422 — 345 with gold spans, 77 negatives)
89
 
90
  | Adapter | Zero-leak | Oversmash | Macro-F1 | Micro-F1 | p50 (ms) |
91
  |---|---:|---:|---:|---:|---:|
@@ -199,10 +241,10 @@ SaaS / API embedding, custom fine-tunes for your domain): **hi@screenpi.pe**.
199
 
200
  ```bibtex
201
  @misc{screenpipe-pii-redactor-2026,
202
- title = {screenpipe-pii-redactor: a fine-tuned PII redactor for
203
- desktop activity logs},
204
- author = {Beaumont, Louis},
205
  year = {2026},
206
- url = {https://github.com/screenpipe/pii-redactor}
207
  }
208
  ```
 
1
  # screenpipe-pii-redactor
2
 
3
+ > A [screenpipe](https://screenpi.pe) project.
4
+
5
+ A fine-tuned PII redactor for the **three surfaces an AI agent actually
6
+ sees a user's machine through**:
7
+
8
+ 1. **Accessibility-tree dumps** — the structured AX hierarchy macOS /
9
+ Windows expose to assistive tech. Short, structured, often containing
10
+ labels like `AXButton[Send to marcus@helios-ai.io]`.
11
+ 2. **OCR'd screen text** — what tools like screenpipe extract from
12
+ screen recordings. Mix of window-title-shaped artifacts, app chrome,
13
+ and occasional long-form (emails, docs).
14
+ 3. **Computer-use traces** — what an agentic model (Claude Computer Use,
15
+ GPT operator, etc.) reads when it controls a desktop. Mix of all of
16
+ the above plus interaction-trace metadata.
17
+
18
+ These surfaces are short, sparse-context, and full of identifiers that
19
+ slip past redactors trained on chat-style prose. This model is fine-tuned
20
+ specifically for them — while still handling long-form text (chat
21
+ transcripts, document body, support tickets) at competitive accuracy.
22
 
23
  Built on top of the [OpenAI Privacy Filter](https://github.com/openai/privacy-filter)
24
  (1.5B parameters, 50M active). Fine-tuned on a mixed corpus combining
25
+ synthetic accessibility / window-title / OCR data, a slice of
26
  [ai4privacy/pii-masking-300k](https://huggingface.co/datasets/ai4privacy/pii-masking-300k),
27
+ and targeted secret-shape augmentation (API keys, JWTs, DB connection
28
+ strings, private-key block markers, password prompts).
29
 
30
+ > **License: CC BY-NC 4.0** (non-commercial). For commercial use
31
+ > production redaction, SaaS / API embedding, AI-agent privacy
32
+ > middleware, custom fine-tunes — contact **hi@screenpi.pe**. See
33
+ > [`LICENSE`](LICENSE).
34
 
35
  ## TL;DR
36
 
37
  | | base OPF | **this model** | gap |
38
  |---|---:|---:|---:|
39
+ | Accessibility / window-title PII zero-leak (n=422) | 38.6% (33.6–43.8) | **79.1% (74.8–83.5)** | **+40.5 pp** |
40
+ | Long-form text PII zero-leak — PII-Masking-300k EN (n=1000) | 14.0% (11.7–16.2) | **77.5% (74.5–80.3)** | **+63.5 pp** |
41
  | Macro-F1 on 300k EN | 0.591 | **0.934** | +0.343 |
42
  | Targeted secret-redaction probe (n=34 realistic shapes) | not measured | **31/34 strict** | — |
43
  | p50 inference latency (CUDA) | ~23 ms | ~23 ms | flat |
44
 
45
  All gaps statistically significant (non-overlapping 95 % bootstrap CIs).
46
 
47
+ ## Why this exists (vs the base Privacy Filter)
48
+
49
+ The OpenAI Privacy Filter (and most other public PII redactors) is
50
+ trained on prose-shaped data — letters, Q&A turns, chat corpora.
51
+ A typical accessibility-tree node, OCR'd window title, or computer-use
52
+ log line looks nothing like that:
53
+
54
+ ```
55
+ AXButton[Send to marcus@helios-ai.io]
56
+ Welcome | Acme Corp | xAI Console
57
+ [ScreenCapture 09:14] Slack — #compai-tessera (12 unread)
58
+ ```
59
+
60
+ These are 30-character strings with one or two PII tokens and almost
61
+ no surrounding context. A model trained on chat corpora will conflate
62
+ brand names with people, miss `Arc | Marcus Chen` because it expects
63
+ sentence context, and tag `Raycast` and `Claude` as people. The base
64
+ Privacy Filter scored 38.6 % zero-leak on this surface; this model
65
+ scores 79.1 %.
66
+
67
+ If you're building an **agentic system that reads screen state** — a
68
+ desktop-control agent, a memory layer for browsing, anything that
69
+ streams accessibility/OCR/screen-capture data into an LLM — this is
70
+ the redactor designed for that pipe.
71
+
72
  ## What it does
73
 
74
  Span-level redaction. Given a string, returns `[(start, end, label, text)]`
 
127
  under commercial license). 95 % bootstrap CIs (1,000 resamples) on
128
  zero-leak rate.
129
 
130
+ ### Accessibility / window-title PII (n=422 — 345 with gold spans, 77 negatives)
131
 
132
  | Adapter | Zero-leak | Oversmash | Macro-F1 | Micro-F1 | p50 (ms) |
133
  |---|---:|---:|---:|---:|---:|
 
241
 
242
  ```bibtex
243
  @misc{screenpipe-pii-redactor-2026,
244
+ title = {screenpipe-pii-redactor: a PII redactor for accessibility
245
+ trees, OCR'd screen text, and computer-use traces},
246
+ author = {{screenpipe}},
247
  year = {2026},
248
+ url = {https://huggingface.co/screenpipe/pii-redactor}
249
  }
250
  ```