File size: 13,196 Bytes
e140ddc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bbed34a
e140ddc
 
 
 
 
 
bbed34a
e140ddc
 
 
 
 
 
bbed34a
e140ddc
 
 
 
 
bbed34a
 
e140ddc
 
 
 
 
 
bbed34a
e140ddc
 
 
 
bbed34a
e140ddc
bbed34a
 
 
 
 
 
 
 
 
e140ddc
 
 
 
bbed34a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e140ddc
7941c32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bbed34a
7941c32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bbed34a
 
 
 
e140ddc
bbed34a
e140ddc
bbed34a
 
 
 
e140ddc
 
 
 
bbed34a
 
e140ddc
 
 
bbed34a
e140ddc
 
 
 
bbed34a
 
e140ddc
 
 
 
 
 
bbed34a
e140ddc
bbed34a
 
e140ddc
bbed34a
e140ddc
bbed34a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e140ddc
 
7941c32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bbed34a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
---
license: apache-2.0
base_model: openai/privacy-filter
tags:
- token-classification
- text-classification
- multi-task
- pii-detection
- document-classification
- privacy
datasets:
- ai4privacy/pii-masking-400k
- community-datasets/yahoo_answers_topics
metrics:
- f1
- accuracy
model-index:
- name: privacy-filter-multitask
  results:
  - task:
      type: token-classification
      name: PII Detection (NER)
    dataset:
      name: ai4privacy/pii-masking-400k
      type: ai4privacy/pii-masking-400k
    metrics:
    - type: f1
      value: 0.4925
      name: F1 (strict span-level)
    - type: precision
      value: 0.6968
    - type: recall
      value: 0.3809
  - task:
      type: text-classification
      name: Document Classification (10 classes)
    dataset:
      name: yahoo_answers_topics
      type: community-datasets/yahoo_answers_topics
    metrics:
    - type: accuracy
      value: 0.4776
      name: Test Accuracy
---

# Privacy Filter Multi-Task 🔒📄

A **single model** for simultaneous **PII Detection (NER)** and **Document Classification (10 categories)**.

Adapted from [openai/privacy-filter](https://huggingface.co/openai/privacy-filter) — a 1.4B Sparse MoE transformer with only ~50M active parameters per token.

## Architecture

```
Input → BPE Tokenizer (o200k_base, 200K vocab)

8-layer Sparse MoE Transformer
  • 128 experts, top-4 routing (~50M active params/token)
  • Banded sliding-window attention (window=128)
  • GQA: 14 query heads, 2 KV heads, head_dim=64
  • Hidden size: 640
  ↓                          ↓
NER Head (640→33)        Doc Head (mean-pool → 640→10)
  ↓                          ↓
BIOES PII tags            10-class document category
```

## Results

### PII Detection (NER)

| Metric | Value |
|--------|-------|
| **F1 (strict span-level)** | **0.493** |
| Precision | 0.697 |
| Recall | 0.381 |
| Token Accuracy | 0.944 |

8 entity types: `private_person` · `private_email` · `private_phone` · `private_address` · `private_date` · `private_url` · `account_number` · `secret`

### Document Classification (10 classes)

| Split | Accuracy |
|-------|----------|
| Val | 0.470 |
| **Test** | **0.478** |

Per-class test accuracy:

| Category | Accuracy |
|----------|----------|
| Computers & Internet | 0.688 |
| Family & Relationships | 0.615 |
| Science & Mathematics | 0.556 |
| Health | 0.524 |
| Sports | 0.523 |
| Politics & Government | 0.493 |
| Entertainment & Music | 0.444 |
| Society & Culture | 0.363 |
| Education & Reference | 0.310 |
| Business & Finance | 0.263 |

---

## 🚀 Production Inference Guide

All numbers below are measured on real hardware with both task heads (NER + doc classification) executing on every call. Benchmark script: single forward pass produces PII entity tags **and** document category simultaneously.

### Resource Requirements

| Resource | Value |
|----------|-------|
| Model weights (bf16) | **2.8 GB** GPU VRAM / RAM |
| Model weights (fp32) | **5.6 GB** RAM |
| ONNX variants available upstream | fp16, int8, q4 (see [openai/privacy-filter](https://huggingface.co/openai/privacy-filter/tree/main/onnx)) |
| Min GPU VRAM (bs=1, seq≤512) | **2.9 GB** |
| Min GPU VRAM (bs=64, seq=512) | **6.2 GB** |
| Fits on | T4 (16 GB), L4 (24 GB), A10G (24 GB), A100, any ≥8 GB GPU |

### GPU — Single-Document Latency (NVIDIA A10G, bf16)

Time from raw text to both NER tags + document category:

| Sequence Length | Latency (mean) | Latency (p95) | Latency (p99) |
|:-:|:-:|:-:|:-:|
| 64 tokens | 113 ms | 117 ms | 122 ms |
| 128 tokens | 106 ms | 110 ms | 115 ms |
| 256 tokens | 106 ms | 111 ms | 113 ms |
| 512 tokens | 106 ms | 113 ms | 116 ms |

> Latency is dominated by a fixed ~105 ms kernel-launch overhead from the Sparse MoE routing — it barely changes with sequence length up to 512 tokens.

### GPU — Batched Throughput (NVIDIA A10G, bf16)

| Batch Size | Seq 64 | Seq 128 | Seq 256 | Seq 512 |
|:-:|:-:|:-:|:-:|:-:|
| **1** | 8.9 docs/s | 9.4 docs/s | 9.4 docs/s | 9.4 docs/s |
| **4** | 36 docs/s | 37 docs/s | 37 docs/s | 32 docs/s |
| **8** | 73 docs/s | 73 docs/s | 69 docs/s | 53 docs/s |
| **16** | 139 docs/s | 138 docs/s | 114 docs/s | 73 docs/s |
| **32** | 265 docs/s | 238 docs/s | 165 docs/s | 89 docs/s |
| **64** | **460 docs/s** | **348 docs/s** | **207 docs/s** | **101 docs/s** |

### GPU — Batched Latency Detail (NVIDIA A10G, bf16)

<details>
<summary>Full latency table (click to expand)</summary>

| Batch | Seq Len | Batch Latency (ms) | Per-Doc (ms) | p95 (ms) | p99 (ms) |
|:-:|:-:|:-:|:-:|:-:|:-:|
| 1 | 64 | 113 | 112.7 | 117 | 122 |
| 4 | 64 | 111 | 27.8 | 116 | 118 |
| 8 | 64 | 110 | 13.8 | 114 | 126 |
| 16 | 64 | 115 | 7.2 | 121 | 125 |
| 32 | 64 | 121 | 3.8 | 127 | 135 |
| 64 | 64 | 139 | 2.2 | 144 | 144 |
| 1 | 128 | 106 | 105.9 | 110 | 115 |
| 4 | 128 | 107 | 26.9 | 112 | 115 |
| 8 | 128 | 110 | 13.7 | 115 | 116 |
| 16 | 128 | 116 | 7.3 | 121 | 128 |
| 32 | 128 | 134 | 4.2 | 139 | 143 |
| 64 | 128 | 184 | 2.9 | 189 | 191 |
| 1 | 256 | 106 | 106.1 | 111 | 113 |
| 4 | 256 | 109 | 27.2 | 114 | 115 |
| 8 | 256 | 117 | 14.6 | 123 | 126 |
| 16 | 256 | 140 | 8.8 | 145 | 147 |
| 32 | 256 | 194 | 6.1 | 199 | 202 |
| 64 | 256 | 309 | 4.8 | 314 | 315 |
| 1 | 512 | 106 | 106.5 | 113 | 116 |
| 4 | 512 | 125 | 31.2 | 129 | 130 |
| 8 | 512 | 152 | 19.0 | 158 | 165 |
| 16 | 512 | 219 | 13.7 | 223 | 225 |
| 32 | 512 | 358 | 11.2 | 361 | 364 |
| 64 | 512 | 636 | 9.9 | 639 | 641 |

</details>

### GPU — Peak VRAM Usage (bf16)

| Batch Size | Seq 128 | Seq 256 | Seq 512 |
|:-:|:-:|:-:|:-:|
| 1 | 2,817 MB | 2,824 MB | 2,862 MB |
| 8 | 2,857 MB | 2,936 MB | 3,237 MB |
| 32 | 3,000 MB | 3,309 MB | 4,522 MB |
| 64 | 3,189 MB | 3,809 MB | **6,236 MB** |

> The model is extremely memory-efficient. Even at batch=64, seq=512, it uses only 6.2 GB — comfortably fits on a T4 (16 GB). This is because the Sparse MoE only activates 4 of 128 experts per token.

### CPU — Latency & Throughput (AMD EPYC 7R32, 8 cores, fp32)

| Batch | Seq 64 | Seq 128 | Seq 256 | Seq 512 |
|:-:|:-:|:-:|:-:|:-:|
| **1** | 152 ms (6.6/s) | 193 ms (5.2/s) | 302 ms (3.3/s) | 569 ms (1.8/s) |
| **4** | 278 ms (14.4/s) | 468 ms (8.6/s) | 935 ms (4.3/s) | 2,464 ms (1.6/s) |
| **8** | 467 ms (17.1/s) | 862 ms (9.3/s) | 1,728 ms (4.6/s) | 4,745 ms (1.7/s) |
| **16** | 837 ms (19.1/s) | 1,624 ms (9.9/s) | 3,814 ms (4.2/s) | 9,143 ms (1.7/s) |

> On CPU the model runs at ~152 ms/doc for short texts (seq=64, bs=1) — suitable for low-volume or batch-offline pipelines.

### Daily Throughput Projections

Sustained throughput for a **single device**, running 24/7 at the optimal batch size:

| Sequence Length | GPU (A10G, bf16) | CPU (8-core, fp32) |
|:-:|:-:|:-:|
| 64 tokens | **39.8M docs/day** (460/s, bs=64) | 1.7M docs/day (19/s, bs=16) |
| 128 tokens | **30.1M docs/day** (348/s, bs=64) | 855K docs/day (10/s, bs=16) |
| 256 tokens | **17.9M docs/day** (207/s, bs=64) | 397K docs/day (4.6/s, bs=8) |
| 512 tokens | **8.7M docs/day** (101/s, bs=64) | 156K docs/day (1.8/s, bs=1) |

#### Multi-GPU Scaling Estimates

| Config | seq=128 | seq=256 | seq=512 |
|--------|:-:|:-:|:-:|
| 1× A10G (24 GB, ~$1/hr) | 30M/day | 18M/day | 8.7M/day |
| 1× A100 (80 GB, ~$3/hr) | ~70M/day¹ | ~42M/day¹ | ~20M/day¹ |
| 4× A10G data-parallel | 120M/day | 72M/day | 35M/day |
| 8× A10G data-parallel | 240M/day | 143M/day | 70M/day |

<sub>¹ A100 estimates are linearly extrapolated from A10G numbers using A100's ~2.3× higher memory bandwidth and larger batch capacity. Actual numbers will vary — benchmark on your target hardware.</sub>

### Serving Recommendations

| Deployment Scenario | Recommended Config | Expected Perf |
|---|---|---|
| **Real-time API** (SLA <200ms) | 1× GPU, bs=1, seq≤512 | ~106 ms p50, ~113 ms p95 |
| **Near-real-time** (SLA <500ms) | 1× GPU, bs=8–16, seq≤512 | 53–73 docs/s, p95 <225 ms |
| **High-throughput batch** | 1× GPU, bs=64, seq=256 | 207 docs/s, 17.9M/day |
| **Max throughput batch** | 1× GPU, bs=64, seq=64² | 460 docs/s, 39.8M/day |
| **CPU offline / dev** | CPU, bs=1, seq≤256 | 3–7 docs/s |

<sub>² At seq=64 most documents will be truncated. Use seq=128–256 for production balance.</sub>

**Key observations:**
- The model has a **fixed ~105 ms overhead** per forward pass regardless of sequence length (MoE routing + expert dispatch). Batching amortizes this cost across documents — the per-doc cost drops from 106 ms (bs=1) to under 10 ms (bs=64).
- **Memory is not the bottleneck** — even at bs=64/seq=512 the model uses only 6.2 GB. You can run this on a T4 (16 GB) with room to spare.
- **Optimal batch size for throughput**: bs=64 for all sequence lengths on A10G.
- **Optimal batch size for latency-constrained**: bs=8–16 gives a good per-doc latency (13–19 ms) while keeping batch latency under 225 ms.

---

## Training Strategy

Two-phase training approach:

1. **Phase 1 — Multi-task fine-tuning**: Partially unfroze last 4 MoE layers + both task heads. Trained on 20K NER examples (ai4privacy) + 20K doc examples (Yahoo Answers). Multi-task loss (NER×1.0 + Doc×0.5). 2 epochs, LR=2e-5.

2. **Phase 2 — Doc head retraining** (head-only): Froze entire backbone + NER head. Pre-computed 640-dim pooled features for 100K Yahoo Answers examples. Trained fresh `Linear(640→10)` classifier for 10 epochs, LR=1e-3, cosine decay. This approach:
   - Preserves NER performance exactly (backbone untouched)
   - Is extremely fast (~seconds per epoch on cached features)
   - Achieves **47.8% test accuracy** (up from 24.8% in phase 1)

## Usage

```python
import torch
import torch.nn as nn
from transformers import AutoModelForTokenClassification, AutoTokenizer
from huggingface_hub import hf_hub_download

# Load model + tokenizer
tokenizer = AutoTokenizer.from_pretrained("binga/privacy-filter-multitask")
model = AutoModelForTokenClassification.from_pretrained(
    "binga/privacy-filter-multitask", dtype=torch.bfloat16, device_map="auto"
)

# Load document classification head
doc_head = nn.Linear(640, 10)
doc_head.load_state_dict(torch.load(
    hf_hub_download("binga/privacy-filter-multitask", "doc_head.pt"),
    weights_only=True, map_location=model.device
))
doc_head = doc_head.to(dtype=torch.bfloat16, device=model.device)
doc_head.eval()

# Inference
text = "John Smith (SSN: 123-45-6789) emailed john@corp.com about Q3 earnings."
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# === PII Detection ===
print("PII entities:")
for tok, pred in zip(
    tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]),
    outputs.logits.argmax(-1)[0]
):
    label = model.config.id2label[pred.item()]
    if label != "O":
        print(f"  {tok} → {label}")

# === Document Classification ===
categories = [
    "Society & Culture", "Science & Math", "Health", "Education",
    "Computers & Internet", "Sports", "Business & Finance",
    "Entertainment", "Family", "Politics"
]
hidden = outputs.hidden_states[-1]
mask = inputs["attention_mask"].unsqueeze(-1).to(hidden.dtype)
pooled = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1)
probs = torch.softmax(doc_head(pooled)[0].float(), dim=-1)
top = probs.argmax().item()
print(f"\nCategory: {categories[top]} ({probs[top]:.1%})")
```

### Batched Inference (Production)

```python
# Process a batch of documents — both tasks in a single forward pass
texts = ["doc1...", "doc2...", "doc3...", ...]
inputs = tokenizer(texts, return_tensors="pt", padding=True,
                   truncation=True, max_length=256).to(model.device)

with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# NER predictions for all docs: [batch, seq_len]
ner_preds = outputs.logits.argmax(dim=-1)

# Doc class for all docs: [batch]
hidden = outputs.hidden_states[-1]
mask = inputs["attention_mask"].unsqueeze(-1).to(hidden.dtype)
pooled = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1)
doc_preds = doc_head(pooled).argmax(dim=-1)
```

## Example Outputs

| Input | PII Detected | Category (confidence) |
|-------|-------------|----------------------|
| "My name is John Smith... email john@example.com" | ✅ John Smith, john@example.com, 123 Main St | Computers & Internet (56%) |
| "Liverpool FC defeated Manchester City 3-1" | ❌ None | **Sports (98%)** |
| "Federal Reserve announced a rate cut" | ❌ None | **Politics (52%)** |
| "health benefits of meditation and yoga" | ❌ None | **Health (38%)** |
| "Patient Jane Doe (SSN: 123-45-6789)" | ✅ Jane Doe, 123-45-6789, jane.doe@hospital.com | Education (41%) |
| "learn programming? I want to learn Python" | ❌ None | **Education (53%)** |
| "legal to record phone calls in California?" | ❌ None | **Politics (64%)** |

## Files

| File | Size | Description |
|------|------|-------------|
| `model.safetensors` | 2.6 GB | Backbone + NER head (1.4B MoE params) |
| `doc_head.pt` | 26 KB | Document classification head (640→10) |
| `config.json` | 3 KB | Model architecture config |
| `tokenizer.json` | 27 MB | BPE tokenizer (o200k_base) |
| `multitask_config.json` | 349 B | Multi-task metadata |