File size: 12,088 Bytes
f1943f2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
---
license: cc-by-nc-4.0
language:
  - en
library_name: onnxruntime
pipeline_tag: object-detection
tags:
  - pii
  - privacy
  - redaction
  - object-detection
  - rf-detr
  - screen-capture
  - accessibility
  - computer-use
  - agentic
  - screenpipe
metrics:
  - zero-leak
  - oversmash
  - precision
  - recall
extra_gated_prompt: >-
  This model is licensed CC BY-NC 4.0 (non-commercial). For commercial
  use — production deployment, SaaS / API embedding, agent privacy
  middleware, custom fine-tunes — contact louis@screenpi.pe.
---

# screenpipe-pii-image-redactor

> A [screenpipe](https://screenpi.pe) project. The image-modality
> companion to [`screenpipe/pii-redactor`](https://huggingface.co/screenpipe/pii-redactor).

A fine-tuned **image PII detector** for the same three surfaces an AI
agent sees a user's machine through:

1. **Screen captures** — JPGs / PNGs of the user's screen, rendered
   text and structured chrome (Slack, Outlook, Cursor, Terminal,
   Confluence, GitHub, 1Password, calendars, browsers).
2. **Computer-use traces** — the visual frames an agentic model
   (Claude Computer Use, GPT operator, etc.) reads when it controls a
   desktop.
3. **Accessibility-tree visualizations** — when an agent screenshots
   what it inferred from the AX tree to debug a tool call.

These surfaces are **dense, multi-PII, semi-structured** in ways no
prose-trained PII detector handles well. Returns pixel-space bounding
boxes for 12 canonical PII categories.

ONNX, ~108 MB. Same `.onnx` ships across macOS / Windows / Linux —
the user's ONNX Runtime selects the Execution Provider at load time
(CoreML, DirectML, CUDA, or CPU baseline).

> **License: CC BY-NC 4.0** (non-commercial). For commercial use —
> production redaction, SaaS / API embedding, AI-agent privacy
> middleware, custom fine-tunes — contact **louis@screenpi.pe**. See
> [`LICENSE`](LICENSE).

## Headline numbers

`rfdetr_v8` on a held-out 221-image validation split (190 PII-bearing,
31 hard negatives) of the [screenpipe-pii-bench-image](https://github.com/screenpipe/screenpipe-pii-bench-image)
corpus, IoU ≥ 0.30:

| metric | this model | regex+OCR floor | Microsoft Presidio (published OSS) |
|---|---:|---:|---:|
| **zero-leak** (every gold span caught) | **95.3%** | 2.6% | 0.5% |
| **oversmash** (false-fire on negatives) | **0.0%** | 3.2% | 48.4% |
| micro-precision | 99% | 87% | 47% |
| micro-recall | 97% | 26% | 42% |
| macro-F1 | 0.871 | 0.318 | 0.190 |

Per-label recall (a few highlights): `private_person` 0.99 ·
`private_company` 1.00 · `private_repo` 1.00 · `private_url` 1.00 ·
`secret` 0.99 · `private_email` 0.98 · `private_phone` 0.92 ·
`private_address` 0.92.

### Latency (rfdetr_v8, 320×320 input, FP32)

| platform                      | EP        | p50       |
|-------------------------------|-----------|----------:|
| macOS Apple Silicon (M-series) | CoreML    | **66 ms** ([real-screen sample](https://github.com/screenpipe/screenpipe-pii-bench-image)) |
| macOS Apple Silicon (M-series) | CPU       | 163 ms |
| Windows + DX12 GPU            | DirectML  | ~30-60 ms (estimated) |
| Linux + NVIDIA                | CUDA      | ~10-20 ms (estimated) |
| Linux/Windows CPU-only        | CPU       | ~140 ms |

Same `.onnx` everywhere — Execution Provider is selected at load time
by the user's ONNX Runtime build. **No CUDA / Vulkan / GPU vendor SDKs
required at the consumer.**

## Why this exists (vs Presidio Image Redactor and friends)

The published baselines are trained on prose / generic-document
imagery. A typical screenpipe frame looks nothing like that:

- A Slack channel sidebar with 8 names, 12 channel mentions, 3 emails,
  and 1 pasted AWS key — all in 1440×900 px at 14 px font.
- A 1Password vault entry with structured `[Username | Password |
  Server | One-time password]` rows, half of which are masked dots.
- A Cursor workspace open on `.env.production` with five secret-shaped
  values stacked top-to-bottom.

These images are **dense** (10-20 PII spans per frame), **structured**
(rows / columns / aligned chrome), and **layout-cued** (a thing in the
"Username" cell is a username regardless of its surface text). A
generic NER-on-OCR pipeline misfires by over-redacting UI chrome
(48% false-fire on negatives in our bench, vs. 0% for this model).

If you're building an **agentic system that reads screen state** — a
desktop-control agent, a memory layer for browsing, anything that
streams screen captures into an LLM — this is the redactor designed
for that pipe.

## What it does

Per-image **object detection**. Given a JPG or PNG, returns
`[(bbox, label, score)]` where each detection is a region the model
thinks is PII, classified into one of the 12 canonical categories
shared with [`screenpipe/pii-redactor`](https://huggingface.co/screenpipe/pii-redactor):

```
private_person, private_email, private_phone, private_address,
private_url, private_company, private_repo, private_handle,
private_channel, private_id, private_date, secret
```

`secret` covers passwords, API keys, JWTs, DB connection strings,
PRIVATE-KEY block markers, etc. — same coverage as the text model.

## Inference

```python
# pip install onnxruntime pillow numpy
import numpy as np
import onnxruntime as ort
from PIL import Image

CLASSES = [
    "private_person",   "private_email",   "private_phone",
    "private_address",  "private_url",     "private_company",
    "private_repo",     "private_handle",  "private_channel",
    "private_id",       "private_date",    "secret",
]
INPUT_SIZE = 320  # rfdetr_v8 was exported at 320x320
THRESHOLD  = 0.30

sess = ort.InferenceSession(
    "rfdetr_v8.onnx",
    providers=["CoreMLExecutionProvider", "CPUExecutionProvider"],
)

img = Image.open("screenshot.png").convert("RGB")
W, H = img.size
resized = img.resize((INPUT_SIZE, INPUT_SIZE), Image.BILINEAR)
arr = np.asarray(resized, dtype=np.float32) / 255.0
arr = (arr - [0.485, 0.456, 0.406]) / [0.229, 0.224, 0.225]
arr = arr.transpose(2, 0, 1)[None].astype(np.float32)  # NCHW

boxes, logits = sess.run(None, {sess.get_inputs()[0].name: arr})
boxes  = boxes[0]    # (300, 4) cx, cy, w, h normalized
logits = logits[0]   # (300, 13) — last channel is "no-object"

probs = 1.0 / (1.0 + np.exp(-logits[:, :12]))   # per-class sigmoid
best_class = probs.argmax(axis=1)
best_score = probs[np.arange(300), best_class]
keep = best_score >= THRESHOLD

for q in np.where(keep)[0]:
    cx, cy, bw, bh = boxes[q]
    x1 = (cx - bw / 2) * W
    y1 = (cy - bh / 2) * H
    print(f"  {CLASSES[best_class[q]]:18} score={best_score[q]:.2f} "
          f"bbox=[{int(x1)}, {int(y1)}, {int(bw*W)}, {int(bh*H)}]")
```

Full example with image overlay → `examples/inference.py`.

For Rust integration via the `ort` crate, see the
[`rust_smoke/`](https://github.com/screenpipe/screenpipe-pii-bench-image/tree/main/rust_smoke)
prototype and the production wiring in PR
[`screenpipe/screenpipe#3188`](https://github.com/screenpipe/screenpipe/pull/3188).

## Redacting the image (vs. just detecting)

This model **detects**. To actually remove the PII, draw a solid
rectangle over each detected bbox. Solid black, **not blur** — blur
is reversible by super-resolution attacks; opaque rectangles aren't.

```python
from PIL import ImageDraw
draw = ImageDraw.Draw(img)
for det in detections:        # from the snippet above
    x, y, w, h = det.bbox
    draw.rectangle([x, y, x + w, y + h], fill=(0, 0, 0))
img.save("screenshot_redacted.png")
```

That's the entire redactor wrapper. ~5 lines.

## Architecture

- Base: [RF-DETR-Nano](https://github.com/roboflow/rf-detr) (Roboflow,
  ICLR 2026) — DINOv2-backbone real-time detection transformer, ~25 M
  params, claims first real-time model to break 60 mAP on COCO.
- Fine-tuned at 320×320 input on a 2,833-image synthetic + WebPII
  union (synthetic via DOM-truth bbox extraction; WebPII via the
  [arxiv 2603.17357 release](https://arxiv.org/abs/2603.17357)).
- Output head: 300 detection queries × 13 channels (12 PII classes +
  no-object). Per-class sigmoid (NOT softmax — RF-DETR uses
  independent classification per query).
- Trained on a single A100 80 GB; ~100 minutes wall-clock for the
  best-EMA epoch.

## What was the training data

| source | size | labels | notes |
|---|---:|---|---|
| **synthetic bench** | 2,206 imgs | DOM-truth bboxes (pixel-perfect) | 9 templates rendered via headless Chromium with `data-span` attributes — labels come from the same DOM tree the browser laid out. |
| **WebPII** | 500 imgs (balanced sample) | bbox-labeled by the original authors | March 2026 release, e-commerce screenshots. Class-imbalance capped at 2× our synthetic frequency. |
| **cascade auto-labels** | 100 imgs | OCR + text-PII model alignment | Old screenshots from this project's own bench, weakly labeled. |

**No real user data was used during fine-tuning.** Membership
inference attacks recover no real-user content because no real-user
content was in the training set. If you discover a failure mode on
your real screens, the project's recipe is to add a new SYNTHETIC
template that reproduces it — the screenshot becomes a bug report,
never a training row.

## Limitations

1. **Hand-curated gold set is small** — bench `data/` has 5
   manually-built cases. Larger-scale held-out evaluation depends on
   the synthetic corpus, which is in-distribution by construction.
2. **`private_handle` and `private_id` recall are 0%** in the
   reference numbers because the val split has only 2 and 1 examples
   respectively. Don't deploy without a domain-specific eval pass.
3. **Synthetic-template ceiling.** 95.3% zero-leak is the bench's
   stable ceiling at this corpus size. Gains beyond come from training
   on more real-screen failure modes (tracked in the bench's backlog).
4. **WebPII is e-commerce-heavy.** Adding the full WebPII split
   actually *hurt* dev-app accuracy in our experiments (rfdetr_v4 at
   90.5% zero-leak vs. v8's 95.3%). The 500-image balanced sample is
   our best-of-both compromise.
5. **CPU-only floors at ~140 ms p50.** INT8 quantization (planned)
   gets that under 100 ms, but the FP32 release is what's on this
   page today.
6. **English-only.** Synthetic templates render Latin-script text;
   the WebPII supplement is English. CJK / Arabic / Cyrillic not
   evaluated — don't deploy without a locale-specific eval.
7. **Adversarial robustness not tested.** A user who knows the
   detector exists could craft layouts that confuse it (handwritten
   PII, embedded-image PII, partial occlusion). Use this for
   honest-user privacy, not as a security boundary.

## Files

```
rfdetr_v8.onnx                108 MB · the model · sha256 below
README.md                      this file
LICENSE                        CC BY-NC 4.0
NOTICE                         attribution to base model + datasets
examples/
  inference.py                 the snippet above, runnable
```

SHA-256 of `rfdetr_v8.onnx`:
`431acc0f0beb22a39572b7a50af4fc446e799840fb71320dc124fbd79a121eb3`

## Reproducing inference

```bash
git clone https://huggingface.co/screenpipe/pii-image-redactor
cd pii-image-redactor
git lfs pull
pip install onnxruntime pillow numpy
python examples/inference.py path/to/your_screenshot.png
```

Reproducing the eval scores requires the screenpipe-pii-bench-image
benchmark, which is not redistributed (it's the training corpus).
Contact **louis@screenpi.pe** for benchmark access or commercial
licensing.

## License

[CC BY-NC 4.0](LICENSE) — non-commercial use only. The base model
(RF-DETR) is Apache-2.0; obligations are preserved (see
[`NOTICE`](NOTICE)).

For commercial licensing (production deployment, redistribution
rights, SaaS / API embedding, custom fine-tunes for your domain):
**louis@screenpi.pe**.

## Citation

```bibtex
@misc{screenpipe-pii-image-redactor-2026,
  title  = {screenpipe-pii-image-redactor: a screen-PII detector for
            accessibility-aware agents},
  author = {{screenpipe}},
  year   = {2026},
  url    = {https://huggingface.co/screenpipe/pii-image-redactor}
}
```