screenpipe
/

pii-redactor

@@ -21,8 +21,6 @@ tags:
   - screenpipe
 base_model:
   - openai/privacy-filter
-datasets:
-  - ai4privacy/pii-masking-300k
 metrics:
   - f1
   - recall
@@ -47,44 +45,39 @@ sees a user's machine through**:
    screen recordings. Mix of window-title-shaped artifacts, app chrome,
    and occasional long-form (emails, docs).
 3. **Computer-use traces** — what an agentic model (Claude Computer Use,
-   GPT operator, etc.) reads when it controls a desktop. Mix of all of
-   the above plus interaction-trace metadata.
 These surfaces are short, sparse-context, and full of identifiers that
 slip past redactors trained on chat-style prose. This model is fine-tuned
-specifically for them — while still handling long-form text (chat
-transcripts, document body, support tickets) at competitive accuracy.
 Built on top of the [OpenAI Privacy Filter](https://github.com/openai/privacy-filter)
-(1.5B parameters, 50M active). Fine-tuned on a mixed corpus combining
-synthetic accessibility / window-title / OCR data, a slice of
-[ai4privacy/pii-masking-300k](https://huggingface.co/datasets/ai4privacy/pii-masking-300k),
-and targeted secret-shape augmentation (API keys, JWTs, DB connection
-strings, private-key block markers, password prompts).
 > **License: CC BY-NC 4.0** (non-commercial). For commercial use —
 > production redaction, SaaS / API embedding, AI-agent privacy
 > middleware, custom fine-tunes — contact **hi@louis030195.com**. See
 > [`LICENSE`](LICENSE).
-## TL;DR
-|  | base OPF | **this model** | gap |
-|---|---:|---:|---:|
-| Accessibility / window-title PII zero-leak (n=422) | 38.6% (33.6–43.8) | **79.1% (74.8–83.5)** | **+40.5 pp** |
-| Long-form text PII zero-leak — PII-Masking-300k EN (n=1000) | 14.0% (11.7–16.2) | **77.5% (74.5–80.3)** | **+63.5 pp** |
-| Macro-F1 on 300k EN | 0.591 | **0.934** | +0.343 |
-| Targeted secret-redaction probe (n=34 realistic shapes) | not measured | **31/34 strict** | — |
-| p50 inference latency (CUDA) | ~23 ms | ~23 ms | flat |
-All gaps statistically significant (non-overlapping 95 % bootstrap CIs).
 ## Why this exists (vs the base Privacy Filter)
 The OpenAI Privacy Filter (and most other public PII redactors) is
-trained on prose-shaped data — letters, Q&A turns, chat corpora.
-A typical accessibility-tree node, OCR'd window title, or computer-use
-log line looks nothing like that:
 ```
 AXButton[Send to marcus@helios-ai.io]
@@ -95,14 +88,12 @@ Welcome | Acme Corp | xAI Console
 These are 30-character strings with one or two PII tokens and almost
 no surrounding context. A model trained on chat corpora will conflate
 brand names with people, miss `Arc | Marcus Chen` because it expects
-sentence context, and tag `Raycast` and `Claude` as people. The base
-Privacy Filter scored 38.6 % zero-leak on this surface; this model
-scores 79.1 %.
 If you're building an **agentic system that reads screen state** — a
 desktop-control agent, a memory layer for browsing, anything that
-streams accessibility/OCR/screen-capture data into an LLM — this is
-the redactor designed for that pipe.
 ## What it does
@@ -117,28 +108,7 @@ private_channel, private_id, private_date, secret
 ```
 `secret` covers passwords, API keys, JWTs, DB connection strings,
-PRIVATE-KEY block markers, etc. Per the secret-redaction probe, this
-model catches 31 of 34 realistic secret shapes — see Limitations for
-the lone known miss.
-## Architecture
-Identical to the upstream Privacy Filter. We did not modify the model
-architecture. We re-initialized the output head for our 12-label space
-(49 output classes after BIOES tagging + O), fine-tuned on a mixed
-corpus, with `n_ctx` raised from 128 → 256 to accommodate sentence-level
-context.
-| | |
-|---|---|
-| Base | OpenAI Privacy Filter (1.5B params, 50M active) |
-| Output head | 49-class (12 × BIOES + O), 29 rows copied exactly from base, 20 fallback (zero-init) |
-| Dtype | bfloat16 |
-| Encoding | `o200k_base` |
-| Training | 3 epochs, batch_size 4, lr 1e-4, n_ctx 256 |
-| Hardware | 1 × NVIDIA A100 SXM4 40GB |
-| Training time | ~11 minutes |
-| Best epoch | 2 (val_loss 0.106) |
 ## Inference
@@ -154,147 +124,63 @@ for span in out.detected_spans:
 ```
 See [`examples/inference.py`](examples/inference.py) for a longer example
-including batched redaction across a screen-capture log file.
-## Evaluation
-All numbers come from a held-out benchmark (private; access available
-under commercial license). 95 % bootstrap CIs (1,000 resamples) on
-zero-leak rate.
-### Accessibility / window-title PII (n=422 — 345 with gold spans, 77 negatives)
-| Adapter | Zero-leak | Oversmash | Macro-F1 | Micro-F1 | p50 (ms) |
-|---|---:|---:|---:|---:|---:|
-| **this model** | **79.1% (74.8–83.5)** | 7.8% | 0.690 | 0.822 | 23 |
-| previous internal version | 78.0% (73.6–82.3) | 6.5% | 0.698 | 0.829 | 23 |
-| OpenAI Privacy Filter (base) | 38.6% (33.6–43.8) | 9.1% | 0.346 | 0.526 | 23 |
-| `layered` (regex + base + heuristics) | 65.8% (60.9–71.0) | 2.6% | 0.712 | 0.765 | 23 |
-| `gliner_pii` | 62.6% (57.1–67.5) | 79.2% | 0.444 | 0.526 | 104 |
-| Microsoft Presidio | 35.4% (30.4–40.3) | 22.1% | 0.199 | 0.430 | 6 |
-### PII-Masking-300k cross-eval (English val, n=1000)
-| Adapter | Zero-leak | Oversmash | Macro-F1 | Micro-F1 |
-|---|---:|---:|---:|---:|
-| **this model** | **77.5% (74.5–80.3)** | 16.5% | **0.934** | **0.933** |
-| previous internal version | 74.5% (71.8–77.5) | 9.1% | 0.763 | 0.932 |
-| OpenAI Privacy Filter (base) | 14.0% (11.7–16.2) | 16.5% | 0.591 | 0.579 |
-> **What "14% zero-leak" for the base actually means** (read this before
-> citing the gap). Zero-leak is a strict, taxonomy-coupled metric: a
-> single example counts as "leaked" if the model misses ANY gold span
-> in it under our 12-class label mapping. The published OpenAI Privacy
-> Filter result is **F1 ≈ 96 %** on PII-Masking-300k under THEIR
-> 49-class taxonomy — that's a much more lenient setup. The base scores
-> 14 % zero-leak under our metric for two compounding reasons:
->
-> 1. **Label-space mismatch** dominates. We map 28 source 300k labels
->    into our 12 classes; the base model can't predict our label names.
->    On categories where the base's native taxonomy DOES align with ours
->    (`private_email`, `private_phone`, `private_url`, `secret`), the
->    base scores **0.90–1.00 recall** — strong. On categories where it
->    doesn't (`private_id` covering IDCARD/SOCIALNUMBER/PASSPORT,
->    `private_handle` covering USERNAME), it scores **0.00** by
->    definition because it never emits the right label.
-> 2. **Zero-leak is all-or-nothing per example.** With ~6 spans per
->    300k example and any unmappable category present, base fails the
->    whole example. Token-level F1 (0.591 above) is the more honest
->    cross-comparison number.
->
-> The +63 pp claim **is** real and useful for the deployment context
-> (anyone shipping a system that needs the screenpipe 12-class
-> taxonomy gets +63 pp out of the box vs the base). It would be
-> misleading to read it as "this model is 5× more accurate at PII
-> detection" — that's not what the metric measures.
-### Multilingual generalization (n=200 per language)
-This model was trained on English-only data. Cross-language transfer:
-| Language | this model zero-leak | base zero-leak | Δ vs base |
-|---|---:|---:|---:|
-| English | 76.8% (70.1–83.1) | 14.0% (11.7–16.2) | +62.8 |
-| Spanish | 73.2% (66.5–79.3) | — | — |
-| Italian | 70.8% (64.3–77.4) | — | — |
-| German | 70.6% (63.5–77.1) | 11.8% (7.6–16.5) | +58.8 |
-| French | 68.1% (61.5–75.3) | 14.8% (9.9–20.3) | +53.3 |
-| Dutch | 56.1% (48.9–63.3) | — | — |
-Romance + Germanic languages drop −3 to −9 pp from English. **Dutch is
-the weakest at −20.7 pp** — flagged as a known gap.
-### Per-category recall (English, n=1000)
-| Category | base | this model |
-|---|---:|---:|
-| `private_address` | 0.65 | 0.93 |
-| `private_date` | 0.54 | 0.96 |
-| `private_email` | 1.00 | 0.97 |
-| `private_handle` | 0.00 | 0.82 |
-| `private_id` | 0.00 | 0.95 |
-| `private_person` | 0.71 | 0.93 |
-| `private_phone` | 0.97 | 0.93 |
-| `private_url` | 0.98 | 1.00 |
-| `secret` | 0.90 | 0.90 |
-## Limitations and known failure modes
-1. **Sudo / login password prompts leak.** A pattern like `[sudo]
    password for alice: hunter2` results in the username being redacted
-   but the password surviving. Targeted augmentation closed 4 of 5 such
-   patterns; this is the lone surviving hard miss. **Mitigation**: use
-   an OS-level keystroke-suppression policy alongside this model when
-   the screen capture surface includes terminal sessions.
-2. **Dutch is the weakest language** at −20.7 pp from English. Romance +
-   Germanic languages other than Dutch generalize at −3 to −9 pp. Indic,
-   Asian, African, Cyrillic scripts NOT evaluated at meaningful sample
    sizes — don't deploy without a locale-specific eval pass.
-3. **In-distribution generalization on 300k.** The model's training
-   corpus included a slice of the PII-Masking-300k *train* split; the
-   eval reports above are on the *val* split (disjoint examples but
-   same distribution). The window-title score (79.1 %) is the cleaner
-   generalization signal.
-4. **Synthetic training data only.** Validated qualitatively on real
-   screen captures, but the corpus is fully synthetic. Validate on
-   YOUR data before deploying.
-5. **Single-annotator gold labels** on the in-bench data. Absolute
-   numbers may shift under a 2nd-annotator pass; relative ordering
-   between adapters is more stable.
-6. **Oversmash is non-trivial.** 7.8 % on window titles, 16.5 % on
    long-form text. The model over-redacts. Acceptable for privacy-first
    deployments; flag if you need clean OCR text downstream.
-7. **Soft taxonomy hits.** Sometimes redacts secrets correctly but
-   under a different label (`private_id` for `rk_live_…` Stripe keys,
-   `private_url` for whole DB connection strings). Privacy-correct,
-   per-category accounting blurry.
-## Reproducing the inference numbers
-The held-out benchmark and training methodology are in a private
-repository. Inference is reproducible from the artifacts in this repo:
 ```bash
-git clone https://github.com/screenpipe/pii-redactor
 cd pii-redactor
-# pull the model weights via Git LFS
 git lfs pull
-# install opf (currently from source)
 pip install git+https://github.com/openai/privacy-filter.git
-# run the inference example
 python examples/inference.py
 ```
-Verifying the eval scores requires the held-out benchmark. Contact
-**hi@louis030195.com** for benchmark access if you have a research or
-commercial use case.
 ## License
-[CC BY-NC 4.0](LICENSE) — non-commercial use only.
 For commercial licensing (production deployment, redistribution rights,
 SaaS / API embedding, custom fine-tunes for your domain): **hi@louis030195.com**.

   - screenpipe
 base_model:
   - openai/privacy-filter
 metrics:
   - f1
   - recall
    screen recordings. Mix of window-title-shaped artifacts, app chrome,
    and occasional long-form (emails, docs).
 3. **Computer-use traces** — what an agentic model (Claude Computer Use,
+   GPT operator, etc.) reads when it controls a desktop.
 These surfaces are short, sparse-context, and full of identifiers that
 slip past redactors trained on chat-style prose. This model is fine-tuned
+specifically for them — while still handling long-form text at
+competitive accuracy.
 Built on top of the [OpenAI Privacy Filter](https://github.com/openai/privacy-filter)
+(1.5B parameters, 50M active).
 > **License: CC BY-NC 4.0** (non-commercial). For commercial use —
 > production redaction, SaaS / API embedding, AI-agent privacy
 > middleware, custom fine-tunes — contact **hi@louis030195.com**. See
 > [`LICENSE`](LICENSE).
+## Headline numbers
+|  | base OPF | **this model** |
+|---|---:|---:|
+| Accessibility / window-title PII zero-leak | 38.6% (33.6–43.8) | **79.1% (74.8–83.5)** |
+| Long-form PII zero-leak (English) | 14.0% (11.7–16.2) | **77.5% (74.5–80.3)** |
+| Long-form PII macro-F1 (English) | 0.591 | **0.934** |
+| Targeted secret-redaction (34 realistic shapes) | not measured | **31/34** |
+| p50 inference latency (CUDA) | ~23 ms | ~23 ms |
+95% bootstrap CIs in brackets. Zero-leak: % of cases where the model
+caught all gold spans (the metric that matters for privacy).
 ## Why this exists (vs the base Privacy Filter)
 The OpenAI Privacy Filter (and most other public PII redactors) is
+trained on prose-shaped data. A typical accessibility-tree node, OCR'd
+window title, or computer-use log line looks nothing like that:
 ```
 AXButton[Send to marcus@helios-ai.io]
 These are 30-character strings with one or two PII tokens and almost
 no surrounding context. A model trained on chat corpora will conflate
 brand names with people, miss `Arc | Marcus Chen` because it expects
+sentence context, and tag `Raycast` and `Claude` as people.
 If you're building an **agentic system that reads screen state** — a
 desktop-control agent, a memory layer for browsing, anything that
+streams accessibility / OCR / screen-capture data into an LLM — this
+is the redactor designed for that pipe.
 ## What it does
 ```
 `secret` covers passwords, API keys, JWTs, DB connection strings,
+PRIVATE-KEY block markers, etc.
 ## Inference
 ```
 See [`examples/inference.py`](examples/inference.py) for a longer example
+covering window titles, long-form text, and secrets.
+## Multilingual
+This model handles 6 languages. Performance on a public long-form PII
+benchmark (n=200 per language):
+| Language | zero-leak |
+|---|---:|
+| English | 76.8% (70.1–83.1) |
+| Spanish | 73.2% (66.5–79.3) |
+| Italian | 70.8% (64.3–77.4) |
+| German | 70.6% (63.5–77.1) |
+| French | 68.1% (61.5–75.3) |
+| Dutch | 56.1% (48.9–63.3) |
+Romance + Germanic languages drop −3 to −9 pp from English.
+**Dutch is the weakest** — flagged as a known gap.
+## Limitations
+1. **Sudo / login password prompts leak.** Pattern like `[sudo]
    password for alice: hunter2` results in the username being redacted
+   but the password surviving. One known hard miss in the targeted
+   secret probe; mitigate with an OS-level keystroke-suppression policy
+   alongside this model.
+2. **Dutch is the weakest language** at −20.7 pp from English. Indic,
+   Asian, African, Cyrillic scripts not evaluated at meaningful sample
    sizes — don't deploy without a locale-specific eval pass.
+3. **Synthetic training data only.** No real user data was used during
+   fine-tuning. Validate on YOUR data before deploying.
+4. **Oversmash.** 7.8% on accessibility / window titles, 16.5% on
    long-form text. The model over-redacts. Acceptable for privacy-first
    deployments; flag if you need clean OCR text downstream.
+5. **Strict label-space evaluation.** The numbers above use a
+   12-class taxonomy and a strict per-example zero-leak metric.
+   Absolute values depend on the evaluator's label taxonomy and metric
+   choice; macro-F1 is a more lenient point of comparison.
+## Reproducing inference
 ```bash
+git clone https://huggingface.co/screenpipe/pii-redactor
 cd pii-redactor
 git lfs pull
 pip install git+https://github.com/openai/privacy-filter.git
 python examples/inference.py
 ```
+Reproducing the eval scores requires our held-out benchmark, which is
+not redistributed. Contact **hi@louis030195.com** for benchmark access
+or commercial licensing.
 ## License
+[CC BY-NC 4.0](LICENSE) — non-commercial use only. The base model is
+Apache-2.0; obligations are preserved (see [`NOTICE`](NOTICE)).
 For commercial licensing (production deployment, redistribution rights,
 SaaS / API embedding, custom fine-tunes for your domain): **hi@louis030195.com**.