Spaces:
Running on Zero
Running on Zero
Update app_v6.py
Browse files
app_v6.py
CHANGED
|
@@ -1,33 +1,7 @@
|
|
| 1 |
"""
|
| 2 |
-
PII Reveal - Document Privacy Explorer (v6)
|
| 3 |
============================================
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
1. Scan got slower in v5 (225s vs ~120s in v4)
|
| 7 |
-
The v5 change moved the Viterbi decoder loop to CPU because I assumed
|
| 8 |
-
the dominant cost on T4 was kernel-launch overhead. That was wrong.
|
| 9 |
-
PyTorch's per-op dispatch cost on CPU (~50-200Β΅s) is much higher than
|
| 10 |
-
a CUDA kernel launch, so a 100k-iteration loop with 4-5 ops per iter
|
| 11 |
-
regressed ~90-100s. Viterbi goes back on GPU (v4 behaviour). The
|
| 12 |
-
remaining inference cost on T4 is fundamental β bf16 attention
|
| 13 |
-
matmuls are emulated on Turing and there's no code fix for that, only
|
| 14 |
-
a hardware change to L4 / A10 / A100.
|
| 15 |
-
|
| 16 |
-
2. Redacted PDF export took "several minutes" on large files
|
| 17 |
-
The v5 implementation called page.search_for(needle) for every
|
| 18 |
-
unique PII string on every page, even when the string didn't appear
|
| 19 |
-
on the page. For a 100-page doc with 200 unique PII strings that's
|
| 20 |
-
20k search calls.
|
| 21 |
-
|
| 22 |
-
v6 does a cheap `needle in page_text` prefilter per page (one
|
| 23 |
-
page.get_text() call) before issuing any search_for, skips
|
| 24 |
-
apply_redactions on pages with no matches, and saves with
|
| 25 |
-
garbage=1 instead of 4 (the aggressive stream-recompression setting
|
| 26 |
-
is slow and brings little benefit after redaction).
|
| 27 |
-
|
| 28 |
-
3. Retained from v5: light-theme refresh, PDF redaction endpoint, UI.
|
| 29 |
-
Retained code-level perf win: single torch.cat instead of the
|
| 30 |
-
unbind-then-stack roundtrip in predict_text.
|
| 31 |
"""
|
| 32 |
|
| 33 |
# ββ stdlib βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
|
@@ -809,7 +783,7 @@ FRONTEND_HTML = r"""<!DOCTYPE html>
|
|
| 809 |
<head>
|
| 810 |
<meta charset="UTF-8">
|
| 811 |
<meta name="viewport" content="width=device-width,initial-scale=1">
|
| 812 |
-
<title>PII
|
| 813 |
<link rel="preconnect" href="https://fonts.googleapis.com">
|
| 814 |
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
|
| 815 |
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&family=JetBrains+Mono:wght@400;500&family=Source+Serif+4:opsz,wght@8..60,400;8..60,500;8..60,600&display=swap" rel="stylesheet">
|
|
@@ -1072,7 +1046,7 @@ button{font:inherit;color:inherit;background:transparent;border:0;cursor:pointer
|
|
| 1072 |
<circle cx="8.5" cy="8.5" r="3.2" stroke="var(--block-background-fill)" stroke-width="1.4" fill="none"/>
|
| 1073 |
<line x1="11.2" y1="11.2" x2="14.2" y2="14.2" stroke="var(--block-background-fill)" stroke-width="1.4" stroke-linecap="round"/>
|
| 1074 |
</svg>
|
| 1075 |
-
<span class="u-brand-name">PII
|
| 1076 |
</div>
|
| 1077 |
<h1 class="u-title">See what your documents are leaking.</h1>
|
| 1078 |
<p class="u-sub">Find every PII span in a PDF, DOC or DOCX β names, accounts, secrets and five other entity types β then export a fully redacted copy.</p>
|
|
@@ -1100,10 +1074,10 @@ button{font:inherit;color:inherit;background:transparent;border:0;cursor:pointer
|
|
| 1100 |
</div>
|
| 1101 |
|
| 1102 |
<div class="u-meta">
|
| 1103 |
-
<span>
|
| 1104 |
<span>128k ctx</span>
|
| 1105 |
-
<span>bfloat16</span>
|
| 1106 |
<span>apache 2.0</span>
|
|
|
|
| 1107 |
</div>
|
| 1108 |
</div>
|
| 1109 |
|
|
@@ -1136,7 +1110,7 @@ button{font:inherit;color:inherit;background:transparent;border:0;cursor:pointer
|
|
| 1136 |
<!-- ============ results view ============ -->
|
| 1137 |
<div id="results-view">
|
| 1138 |
<div class="shell">
|
| 1139 |
-
<div class="pr-app" aria-label="PII
|
| 1140 |
|
| 1141 |
<div class="pr-top">
|
| 1142 |
<div class="pr-logo">
|
|
@@ -1145,7 +1119,7 @@ button{font:inherit;color:inherit;background:transparent;border:0;cursor:pointer
|
|
| 1145 |
<circle cx="8.5" cy="8.5" r="3.2" stroke="var(--block-background-fill)" stroke-width="1.4" fill="none"/>
|
| 1146 |
<line x1="11.2" y1="11.2" x2="14.2" y2="14.2" stroke="var(--block-background-fill)" stroke-width="1.4" stroke-linecap="round"/>
|
| 1147 |
</svg>
|
| 1148 |
-
<span class="pr-name">PII
|
| 1149 |
</div>
|
| 1150 |
<span class="pr-file-chip" id="file-chip"></span>
|
| 1151 |
<span class="pr-status" id="scan-status"><span class="pr-status-dot"></span>Scan complete</span>
|
|
|
|
| 1 |
"""
|
|
|
|
| 2 |
============================================
|
| 3 |
+
PII Explorer - Document Privacy Explorer / Playground
|
| 4 |
+
============================================
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
"""
|
| 6 |
|
| 7 |
# ββ stdlib βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
|
|
|
| 783 |
<head>
|
| 784 |
<meta charset="UTF-8">
|
| 785 |
<meta name="viewport" content="width=device-width,initial-scale=1">
|
| 786 |
+
<title>PII Explorer β Playground</title>
|
| 787 |
<link rel="preconnect" href="https://fonts.googleapis.com">
|
| 788 |
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
|
| 789 |
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&family=JetBrains+Mono:wght@400;500&family=Source+Serif+4:opsz,wght@8..60,400;8..60,500;8..60,600&display=swap" rel="stylesheet">
|
|
|
|
| 1046 |
<circle cx="8.5" cy="8.5" r="3.2" stroke="var(--block-background-fill)" stroke-width="1.4" fill="none"/>
|
| 1047 |
<line x1="11.2" y1="11.2" x2="14.2" y2="14.2" stroke="var(--block-background-fill)" stroke-width="1.4" stroke-linecap="round"/>
|
| 1048 |
</svg>
|
| 1049 |
+
<span class="u-brand-name">PII Explorer<span class="sub">/ Playground</span></span>
|
| 1050 |
</div>
|
| 1051 |
<h1 class="u-title">See what your documents are leaking.</h1>
|
| 1052 |
<p class="u-sub">Find every PII span in a PDF, DOC or DOCX β names, accounts, secrets and five other entity types β then export a fully redacted copy.</p>
|
|
|
|
| 1074 |
</div>
|
| 1075 |
|
| 1076 |
<div class="u-meta">
|
| 1077 |
+
<span><b>OpenAI Privacy Filter</b></span>
|
| 1078 |
<span>128k ctx</span>
|
|
|
|
| 1079 |
<span>apache 2.0</span>
|
| 1080 |
+
<span><b>gr.Server</b></span>
|
| 1081 |
</div>
|
| 1082 |
</div>
|
| 1083 |
|
|
|
|
| 1110 |
<!-- ============ results view ============ -->
|
| 1111 |
<div id="results-view">
|
| 1112 |
<div class="shell">
|
| 1113 |
+
<div class="pr-app" aria-label="PII Explorer Playground">
|
| 1114 |
|
| 1115 |
<div class="pr-top">
|
| 1116 |
<div class="pr-logo">
|
|
|
|
| 1119 |
<circle cx="8.5" cy="8.5" r="3.2" stroke="var(--block-background-fill)" stroke-width="1.4" fill="none"/>
|
| 1120 |
<line x1="11.2" y1="11.2" x2="14.2" y2="14.2" stroke="var(--block-background-fill)" stroke-width="1.4" stroke-linecap="round"/>
|
| 1121 |
</svg>
|
| 1122 |
+
<span class="pr-name">PII Explorer<span class="pr-name-sub">/ Playground</span></span>
|
| 1123 |
</div>
|
| 1124 |
<span class="pr-file-chip" id="file-chip"></span>
|
| 1125 |
<span class="pr-status" id="scan-status"><span class="pr-status-dot"></span>Scan complete</span>
|