---
license: cc-by-nc-4.0
language:
  - en
  - fr
  - de
  - it
  - es
  - nl
library_name: opf
pipeline_tag: token-classification
tags:
  - pii
  - privacy
  - redaction
  - accessibility-tree
  - ocr
  - computer-use
  - agentic
  - screen-capture
  - screenpipe
base_model:
  - openai/privacy-filter
metrics:
  - f1
  - recall
  - precision
extra_gated_prompt: >-
  This model is licensed CC BY-NC 4.0 (non-commercial). For commercial
  use — production deployment, SaaS / API embedding, agent privacy
  middleware, custom fine-tunes — contact louis@screenpi.pe.
---

# screenpipe-pii-redactor

> A [screenpipe](https://screenpi.pe) project.

A fine-tuned PII redactor for the **three surfaces an AI agent actually
sees a user's machine through**:

1. **Accessibility-tree dumps** — the structured AX hierarchy macOS /
   Windows expose to assistive tech. Short, structured, often containing
   labels like `AXButton[Send to marcus@helios-ai.io]`.
2. **OCR'd screen text** — what tools like screenpipe extract from
   screen recordings. Mix of window-title-shaped artifacts, app chrome,
   and occasional long-form (emails, docs).
3. **Computer-use traces** — what an agentic model (Claude Computer Use,
   GPT operator, etc.) reads when it controls a desktop.

These surfaces are short, sparse-context, and full of identifiers that
slip past redactors trained on chat-style prose. This model is fine-tuned
specifically for them — while still handling long-form text at
competitive accuracy.

Built on top of the [OpenAI Privacy Filter](https://github.com/openai/privacy-filter)
(1.5B parameters, 50M active).

> **License: CC BY-NC 4.0** (non-commercial). For commercial use —
> production redaction, SaaS / API embedding, AI-agent privacy
> middleware, custom fine-tunes — contact **louis@screenpi.pe**. See
> [`LICENSE`](LICENSE).

## Headline numbers

|  | base OPF | **this model** |
|---|---:|---:|
| Accessibility / window-title PII zero-leak | 38.6% (33.6–43.8) | **79.1% (74.8–83.5)** |
| Long-form PII zero-leak (English) | 14.0% (11.7–16.2) | **77.5% (74.5–80.3)** |
| Long-form PII macro-F1 (English) | 0.591 | **0.934** |
| Targeted secret-redaction (34 realistic shapes) | not measured | **31/34** |
| p50 inference latency (CUDA) | ~23 ms | ~23 ms |

95% bootstrap CIs in brackets. Zero-leak: % of cases where the model
caught all gold spans (the metric that matters for privacy).

## Why this exists (vs the base Privacy Filter)

The OpenAI Privacy Filter (and most other public PII redactors) is
trained on prose-shaped data. A typical accessibility-tree node, OCR'd
window title, or computer-use log line looks nothing like that:

```
AXButton[Send to marcus@helios-ai.io]
Welcome | Acme Corp | xAI Console
[ScreenCapture 09:14] Slack — #compai-tessera (12 unread)
```

These are 30-character strings with one or two PII tokens and almost
no surrounding context. A model trained on chat corpora will conflate
brand names with people, miss `Arc | Marcus Chen` because it expects
sentence context, and tag `Raycast` and `Claude` as people.

If you're building an **agentic system that reads screen state** — a
desktop-control agent, a memory layer for browsing, anything that
streams accessibility / OCR / screen-capture data into an LLM — this
is the redactor designed for that pipe.

## What it does

Span-level redaction. Given a string, returns `[(start, end, label, text)]`
where each span is a region the model thinks is PII, classified into one
of 12 canonical categories:

```
private_person, private_email, private_phone, private_address,
private_url, private_company, private_repo, private_handle,
private_channel, private_id, private_date, secret
```

`secret` covers passwords, API keys, JWTs, DB connection strings,
PRIVATE-KEY block markers, etc.

## Inference

```python
# pip install opf  (currently from source: github.com/openai/privacy-filter)
from opf import OPF

filt = OPF(model="./model", device="cuda")  # or "cpu"
out = filt.redact("Welcome | Marcus Chen — Confluence")
for span in out.detected_spans:
    print(f"  [{span.start}:{span.end}] {span.label} = {span.text!r}")
# -> [10:21] private_person = 'Marcus Chen'
```

See [`examples/inference.py`](examples/inference.py) for a longer example
covering window titles, long-form text, and secrets.

## Multilingual

This model handles 6 languages. Performance on a public long-form PII
benchmark (n=200 per language):

| Language | zero-leak |
|---|---:|
| English | 76.8% (70.1–83.1) |
| Spanish | 73.2% (66.5–79.3) |
| Italian | 70.8% (64.3–77.4) |
| German | 70.6% (63.5–77.1) |
| French | 68.1% (61.5–75.3) |
| Dutch | 56.1% (48.9–63.3) |

Romance + Germanic languages drop −3 to −9 pp from English.
**Dutch is the weakest** — flagged as a known gap.

## Limitations

1. **Sudo / login password prompts leak.** Pattern like `[sudo]
   password for alice: hunter2` results in the username being redacted
   but the password surviving. One known hard miss in the targeted
   secret probe; mitigate with an OS-level keystroke-suppression policy
   alongside this model.
2. **Dutch is the weakest language** at −20.7 pp from English. Indic,
   Asian, African, Cyrillic scripts not evaluated at meaningful sample
   sizes — don't deploy without a locale-specific eval pass.
3. **Synthetic training data only.** No real user data was used during
   fine-tuning. Validate on YOUR data before deploying.
4. **Oversmash.** 7.8% on accessibility / window titles, 16.5% on
   long-form text. The model over-redacts. Acceptable for privacy-first
   deployments; flag if you need clean OCR text downstream.
5. **Strict label-space evaluation.** The numbers above use a
   12-class taxonomy and a strict per-example zero-leak metric.
   Absolute values depend on the evaluator's label taxonomy and metric
   choice; macro-F1 is a more lenient point of comparison.

## Reproducing inference

```bash
git clone https://huggingface.co/screenpipe/pii-redactor
cd pii-redactor
git lfs pull
pip install git+https://github.com/openai/privacy-filter.git
python examples/inference.py
```

Reproducing the eval scores requires our held-out benchmark, which is
not redistributed. Contact **louis@screenpi.pe** for benchmark access
or commercial licensing.

## License

[CC BY-NC 4.0](LICENSE) — non-commercial use only. The base model is
Apache-2.0; obligations are preserved (see [`NOTICE`](NOTICE)).

For commercial licensing (production deployment, redistribution rights,
SaaS / API embedding, custom fine-tunes for your domain): **louis@screenpi.pe**.

## Citation

```bibtex
@misc{screenpipe-pii-redactor-2026,
  title  = {screenpipe-pii-redactor: a PII redactor for accessibility
            trees, OCR'd screen text, and computer-use traces},
  author = {{screenpipe}},
  year   = {2026},
  url    = {https://huggingface.co/screenpipe/pii-redactor}
}
```