ysharma HF Staff commited on
Commit
47b2347
Β·
verified Β·
1 Parent(s): a4f68e5

Update app_v6.py

Browse files
Files changed (1) hide show
  1. app_v6.py +8 -34
app_v6.py CHANGED
@@ -1,33 +1,7 @@
1
  """
2
- PII Reveal - Document Privacy Explorer (v6)
3
  ============================================
4
- Fixes regressions reported against v5.
5
-
6
- 1. Scan got slower in v5 (225s vs ~120s in v4)
7
- The v5 change moved the Viterbi decoder loop to CPU because I assumed
8
- the dominant cost on T4 was kernel-launch overhead. That was wrong.
9
- PyTorch's per-op dispatch cost on CPU (~50-200Β΅s) is much higher than
10
- a CUDA kernel launch, so a 100k-iteration loop with 4-5 ops per iter
11
- regressed ~90-100s. Viterbi goes back on GPU (v4 behaviour). The
12
- remaining inference cost on T4 is fundamental β€” bf16 attention
13
- matmuls are emulated on Turing and there's no code fix for that, only
14
- a hardware change to L4 / A10 / A100.
15
-
16
- 2. Redacted PDF export took "several minutes" on large files
17
- The v5 implementation called page.search_for(needle) for every
18
- unique PII string on every page, even when the string didn't appear
19
- on the page. For a 100-page doc with 200 unique PII strings that's
20
- 20k search calls.
21
-
22
- v6 does a cheap `needle in page_text` prefilter per page (one
23
- page.get_text() call) before issuing any search_for, skips
24
- apply_redactions on pages with no matches, and saves with
25
- garbage=1 instead of 4 (the aggressive stream-recompression setting
26
- is slow and brings little benefit after redaction).
27
-
28
- 3. Retained from v5: light-theme refresh, PDF redaction endpoint, UI.
29
- Retained code-level perf win: single torch.cat instead of the
30
- unbind-then-stack roundtrip in predict_text.
31
  """
32
 
33
  # ── stdlib ───────────────────────────────────────────────────────
@@ -809,7 +783,7 @@ FRONTEND_HTML = r"""<!DOCTYPE html>
809
  <head>
810
  <meta charset="UTF-8">
811
  <meta name="viewport" content="width=device-width,initial-scale=1">
812
- <title>PII Reveal β€” Inspector</title>
813
  <link rel="preconnect" href="https://fonts.googleapis.com">
814
  <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
815
  <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&family=JetBrains+Mono:wght@400;500&family=Source+Serif+4:opsz,wght@8..60,400;8..60,500;8..60,600&display=swap" rel="stylesheet">
@@ -1072,7 +1046,7 @@ button{font:inherit;color:inherit;background:transparent;border:0;cursor:pointer
1072
  <circle cx="8.5" cy="8.5" r="3.2" stroke="var(--block-background-fill)" stroke-width="1.4" fill="none"/>
1073
  <line x1="11.2" y1="11.2" x2="14.2" y2="14.2" stroke="var(--block-background-fill)" stroke-width="1.4" stroke-linecap="round"/>
1074
  </svg>
1075
- <span class="u-brand-name">PII Reveal<span class="sub">/ inspector</span></span>
1076
  </div>
1077
  <h1 class="u-title">See what your documents are leaking.</h1>
1078
  <p class="u-sub">Find every PII span in a PDF, DOC or DOCX β€” names, accounts, secrets and five other entity types β€” then export a fully redacted copy.</p>
@@ -1100,10 +1074,10 @@ button{font:inherit;color:inherit;background:transparent;border:0;cursor:pointer
1100
  </div>
1101
 
1102
  <div class="u-meta">
1103
- <span>openai privacy filter</span>
1104
  <span>128k ctx</span>
1105
- <span>bfloat16</span>
1106
  <span>apache 2.0</span>
 
1107
  </div>
1108
  </div>
1109
 
@@ -1136,7 +1110,7 @@ button{font:inherit;color:inherit;background:transparent;border:0;cursor:pointer
1136
  <!-- ============ results view ============ -->
1137
  <div id="results-view">
1138
  <div class="shell">
1139
- <div class="pr-app" aria-label="PII Reveal inspector">
1140
 
1141
  <div class="pr-top">
1142
  <div class="pr-logo">
@@ -1145,7 +1119,7 @@ button{font:inherit;color:inherit;background:transparent;border:0;cursor:pointer
1145
  <circle cx="8.5" cy="8.5" r="3.2" stroke="var(--block-background-fill)" stroke-width="1.4" fill="none"/>
1146
  <line x1="11.2" y1="11.2" x2="14.2" y2="14.2" stroke="var(--block-background-fill)" stroke-width="1.4" stroke-linecap="round"/>
1147
  </svg>
1148
- <span class="pr-name">PII Reveal<span class="pr-name-sub">/ inspector</span></span>
1149
  </div>
1150
  <span class="pr-file-chip" id="file-chip"></span>
1151
  <span class="pr-status" id="scan-status"><span class="pr-status-dot"></span>Scan complete</span>
 
1
  """
 
2
  ============================================
3
+ PII Explorer - Document Privacy Explorer / Playground
4
+ ============================================
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  """
6
 
7
  # ── stdlib ───────────────────────────────────────────────────────
 
783
  <head>
784
  <meta charset="UTF-8">
785
  <meta name="viewport" content="width=device-width,initial-scale=1">
786
+ <title>PII Explorer β€” Playground</title>
787
  <link rel="preconnect" href="https://fonts.googleapis.com">
788
  <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
789
  <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&family=JetBrains+Mono:wght@400;500&family=Source+Serif+4:opsz,wght@8..60,400;8..60,500;8..60,600&display=swap" rel="stylesheet">
 
1046
  <circle cx="8.5" cy="8.5" r="3.2" stroke="var(--block-background-fill)" stroke-width="1.4" fill="none"/>
1047
  <line x1="11.2" y1="11.2" x2="14.2" y2="14.2" stroke="var(--block-background-fill)" stroke-width="1.4" stroke-linecap="round"/>
1048
  </svg>
1049
+ <span class="u-brand-name">PII Explorer<span class="sub">/ Playground</span></span>
1050
  </div>
1051
  <h1 class="u-title">See what your documents are leaking.</h1>
1052
  <p class="u-sub">Find every PII span in a PDF, DOC or DOCX β€” names, accounts, secrets and five other entity types β€” then export a fully redacted copy.</p>
 
1074
  </div>
1075
 
1076
  <div class="u-meta">
1077
+ <span><b>OpenAI Privacy Filter</b></span>
1078
  <span>128k ctx</span>
 
1079
  <span>apache 2.0</span>
1080
+ <span><b>gr.Server</b></span>
1081
  </div>
1082
  </div>
1083
 
 
1110
  <!-- ============ results view ============ -->
1111
  <div id="results-view">
1112
  <div class="shell">
1113
+ <div class="pr-app" aria-label="PII Explorer Playground">
1114
 
1115
  <div class="pr-top">
1116
  <div class="pr-logo">
 
1119
  <circle cx="8.5" cy="8.5" r="3.2" stroke="var(--block-background-fill)" stroke-width="1.4" fill="none"/>
1120
  <line x1="11.2" y1="11.2" x2="14.2" y2="14.2" stroke="var(--block-background-fill)" stroke-width="1.4" stroke-linecap="round"/>
1121
  </svg>
1122
+ <span class="pr-name">PII Explorer<span class="pr-name-sub">/ Playground</span></span>
1123
  </div>
1124
  <span class="pr-file-chip" id="file-chip"></span>
1125
  <span class="pr-status" id="scan-status"><span class="pr-status-dot"></span>Scan complete</span>