Add comprehensive production inference benchmarks

7941c32 verified 12 days ago

13.2 kB

	---
	license: apache-2.0
	base_model: openai/privacy-filter
	tags:
	- token-classification
	- text-classification
	- multi-task
	- pii-detection
	- document-classification
	- privacy
	datasets:
	- ai4privacy/pii-masking-400k
	- community-datasets/yahoo_answers_topics
	metrics:
	- f1
	- accuracy
	model-index:
	- name: privacy-filter-multitask
	results:
	- task:
	type: token-classification
	name: PII Detection (NER)
	dataset:
	name: ai4privacy/pii-masking-400k
	type: ai4privacy/pii-masking-400k
	metrics:
	- type: f1
	value: 0.4925
	name: F1 (strict span-level)
	- type: precision
	value: 0.6968
	- type: recall
	value: 0.3809
	- task:
	type: text-classification
	name: Document Classification (10 classes)
	dataset:
	name: yahoo_answers_topics
	type: community-datasets/yahoo_answers_topics
	metrics:
	- type: accuracy
	value: 0.4776
	name: Test Accuracy
	---

	# Privacy Filter Multi-Task 🔒📄

	A single model for simultaneous PII Detection (NER) and Document Classification (10 categories).

	Adapted from [openai/privacy-filter](https://huggingface.co/openai/privacy-filter) — a 1.4B Sparse MoE transformer with only ~50M active parameters per token.

	## Architecture

	```
	Input → BPE Tokenizer (o200k_base, 200K vocab)
	↓
	8-layer Sparse MoE Transformer
	• 128 experts, top-4 routing (~50M active params/token)
	• Banded sliding-window attention (window=128)
	• GQA: 14 query heads, 2 KV heads, head_dim=64
	• Hidden size: 640
	↓ ↓
	NER Head (640→33) Doc Head (mean-pool → 640→10)
	↓ ↓
	BIOES PII tags 10-class document category
	```

	## Results

	### PII Detection (NER)

	\| Metric \| Value \|
	\|--------\|-------\|
	\| F1 (strict span-level) \| 0.493 \|
	\| Precision \| 0.697 \|
	\| Recall \| 0.381 \|
	\| Token Accuracy \| 0.944 \|

	8 entity types: `private_person` · `private_email` · `private_phone` · `private_address` · `private_date` · `private_url` · `account_number` · `secret`

	### Document Classification (10 classes)

	\| Split \| Accuracy \|
	\|-------\|----------\|
	\| Val \| 0.470 \|
	\| Test \| 0.478 \|

	Per-class test accuracy:

	\| Category \| Accuracy \|
	\|----------\|----------\|
	\| Computers & Internet \| 0.688 \|
	\| Family & Relationships \| 0.615 \|
	\| Science & Mathematics \| 0.556 \|
	\| Health \| 0.524 \|
	\| Sports \| 0.523 \|
	\| Politics & Government \| 0.493 \|
	\| Entertainment & Music \| 0.444 \|
	\| Society & Culture \| 0.363 \|
	\| Education & Reference \| 0.310 \|
	\| Business & Finance \| 0.263 \|

	---

	## 🚀 Production Inference Guide

	All numbers below are measured on real hardware with both task heads (NER + doc classification) executing on every call. Benchmark script: single forward pass produces PII entity tags and document category simultaneously.

	### Resource Requirements

	\| Resource \| Value \|
	\|----------\|-------\|
	\| Model weights (bf16) \| 2.8 GB GPU VRAM / RAM \|
	\| Model weights (fp32) \| 5.6 GB RAM \|
	\| ONNX variants available upstream \| fp16, int8, q4 (see [openai/privacy-filter](https://huggingface.co/openai/privacy-filter/tree/main/onnx)) \|
	\| Min GPU VRAM (bs=1, seq≤512) \| 2.9 GB \|
	\| Min GPU VRAM (bs=64, seq=512) \| 6.2 GB \|
	\| Fits on \| T4 (16 GB), L4 (24 GB), A10G (24 GB), A100, any ≥8 GB GPU \|

	### GPU — Single-Document Latency (NVIDIA A10G, bf16)

	Time from raw text to both NER tags + document category:

	\| Sequence Length \| Latency (mean) \| Latency (p95) \| Latency (p99) \|
	\|:-:\|:-:\|:-:\|:-:\|
	\| 64 tokens \| 113 ms \| 117 ms \| 122 ms \|
	\| 128 tokens \| 106 ms \| 110 ms \| 115 ms \|
	\| 256 tokens \| 106 ms \| 111 ms \| 113 ms \|
	\| 512 tokens \| 106 ms \| 113 ms \| 116 ms \|

	> Latency is dominated by a fixed ~105 ms kernel-launch overhead from the Sparse MoE routing — it barely changes with sequence length up to 512 tokens.

	### GPU — Batched Throughput (NVIDIA A10G, bf16)

	\| Batch Size \| Seq 64 \| Seq 128 \| Seq 256 \| Seq 512 \|
	\|:-:\|:-:\|:-:\|:-:\|:-:\|
	\| 1 \| 8.9 docs/s \| 9.4 docs/s \| 9.4 docs/s \| 9.4 docs/s \|
	\| 4 \| 36 docs/s \| 37 docs/s \| 37 docs/s \| 32 docs/s \|
	\| 8 \| 73 docs/s \| 73 docs/s \| 69 docs/s \| 53 docs/s \|
	\| 16 \| 139 docs/s \| 138 docs/s \| 114 docs/s \| 73 docs/s \|
	\| 32 \| 265 docs/s \| 238 docs/s \| 165 docs/s \| 89 docs/s \|
	\| 64 \| 460 docs/s \| 348 docs/s \| 207 docs/s \| 101 docs/s \|

	### GPU — Batched Latency Detail (NVIDIA A10G, bf16)

	<details>
	<summary>Full latency table (click to expand)</summary>

	\| Batch \| Seq Len \| Batch Latency (ms) \| Per-Doc (ms) \| p95 (ms) \| p99 (ms) \|
	\|:-:\|:-:\|:-:\|:-:\|:-:\|:-:\|
	\| 1 \| 64 \| 113 \| 112.7 \| 117 \| 122 \|
	\| 4 \| 64 \| 111 \| 27.8 \| 116 \| 118 \|
	\| 8 \| 64 \| 110 \| 13.8 \| 114 \| 126 \|
	\| 16 \| 64 \| 115 \| 7.2 \| 121 \| 125 \|
	\| 32 \| 64 \| 121 \| 3.8 \| 127 \| 135 \|
	\| 64 \| 64 \| 139 \| 2.2 \| 144 \| 144 \|
	\| 1 \| 128 \| 106 \| 105.9 \| 110 \| 115 \|
	\| 4 \| 128 \| 107 \| 26.9 \| 112 \| 115 \|
	\| 8 \| 128 \| 110 \| 13.7 \| 115 \| 116 \|
	\| 16 \| 128 \| 116 \| 7.3 \| 121 \| 128 \|
	\| 32 \| 128 \| 134 \| 4.2 \| 139 \| 143 \|
	\| 64 \| 128 \| 184 \| 2.9 \| 189 \| 191 \|
	\| 1 \| 256 \| 106 \| 106.1 \| 111 \| 113 \|
	\| 4 \| 256 \| 109 \| 27.2 \| 114 \| 115 \|
	\| 8 \| 256 \| 117 \| 14.6 \| 123 \| 126 \|
	\| 16 \| 256 \| 140 \| 8.8 \| 145 \| 147 \|
	\| 32 \| 256 \| 194 \| 6.1 \| 199 \| 202 \|
	\| 64 \| 256 \| 309 \| 4.8 \| 314 \| 315 \|
	\| 1 \| 512 \| 106 \| 106.5 \| 113 \| 116 \|
	\| 4 \| 512 \| 125 \| 31.2 \| 129 \| 130 \|
	\| 8 \| 512 \| 152 \| 19.0 \| 158 \| 165 \|
	\| 16 \| 512 \| 219 \| 13.7 \| 223 \| 225 \|
	\| 32 \| 512 \| 358 \| 11.2 \| 361 \| 364 \|
	\| 64 \| 512 \| 636 \| 9.9 \| 639 \| 641 \|

	</details>

	### GPU — Peak VRAM Usage (bf16)

	\| Batch Size \| Seq 128 \| Seq 256 \| Seq 512 \|
	\|:-:\|:-:\|:-:\|:-:\|
	\| 1 \| 2,817 MB \| 2,824 MB \| 2,862 MB \|
	\| 8 \| 2,857 MB \| 2,936 MB \| 3,237 MB \|
	\| 32 \| 3,000 MB \| 3,309 MB \| 4,522 MB \|
	\| 64 \| 3,189 MB \| 3,809 MB \| 6,236 MB \|

	> The model is extremely memory-efficient. Even at batch=64, seq=512, it uses only 6.2 GB — comfortably fits on a T4 (16 GB). This is because the Sparse MoE only activates 4 of 128 experts per token.

	### CPU — Latency & Throughput (AMD EPYC 7R32, 8 cores, fp32)

	\| Batch \| Seq 64 \| Seq 128 \| Seq 256 \| Seq 512 \|
	\|:-:\|:-:\|:-:\|:-:\|:-:\|
	\| 1 \| 152 ms (6.6/s) \| 193 ms (5.2/s) \| 302 ms (3.3/s) \| 569 ms (1.8/s) \|
	\| 4 \| 278 ms (14.4/s) \| 468 ms (8.6/s) \| 935 ms (4.3/s) \| 2,464 ms (1.6/s) \|
	\| 8 \| 467 ms (17.1/s) \| 862 ms (9.3/s) \| 1,728 ms (4.6/s) \| 4,745 ms (1.7/s) \|
	\| 16 \| 837 ms (19.1/s) \| 1,624 ms (9.9/s) \| 3,814 ms (4.2/s) \| 9,143 ms (1.7/s) \|

	> On CPU the model runs at ~152 ms/doc for short texts (seq=64, bs=1) — suitable for low-volume or batch-offline pipelines.

	### Daily Throughput Projections

	Sustained throughput for a single device, running 24/7 at the optimal batch size:

	\| Sequence Length \| GPU (A10G, bf16) \| CPU (8-core, fp32) \|
	\|:-:\|:-:\|:-:\|
	\| 64 tokens \| 39.8M docs/day (460/s, bs=64) \| 1.7M docs/day (19/s, bs=16) \|
	\| 128 tokens \| 30.1M docs/day (348/s, bs=64) \| 855K docs/day (10/s, bs=16) \|
	\| 256 tokens \| 17.9M docs/day (207/s, bs=64) \| 397K docs/day (4.6/s, bs=8) \|
	\| 512 tokens \| 8.7M docs/day (101/s, bs=64) \| 156K docs/day (1.8/s, bs=1) \|

	#### Multi-GPU Scaling Estimates

	\| Config \| seq=128 \| seq=256 \| seq=512 \|
	\|--------\|:-:\|:-:\|:-:\|
	\| 1× A10G (24 GB, ~$1/hr) \| 30M/day \| 18M/day \| 8.7M/day \|
	\| 1× A100 (80 GB, ~$3/hr) \| ~70M/day¹ \| ~42M/day¹ \| ~20M/day¹ \|
	\| 4× A10G data-parallel \| 120M/day \| 72M/day \| 35M/day \|
	\| 8× A10G data-parallel \| 240M/day \| 143M/day \| 70M/day \|

	<sub>¹ A100 estimates are linearly extrapolated from A10G numbers using A100's ~2.3× higher memory bandwidth and larger batch capacity. Actual numbers will vary — benchmark on your target hardware.</sub>

	### Serving Recommendations

	\| Deployment Scenario \| Recommended Config \| Expected Perf \|
	\|---\|---\|---\|
	\| Real-time API (SLA <200ms) \| 1× GPU, bs=1, seq≤512 \| ~106 ms p50, ~113 ms p95 \|
	\| Near-real-time (SLA <500ms) \| 1× GPU, bs=8–16, seq≤512 \| 53–73 docs/s, p95 <225 ms \|
	\| High-throughput batch \| 1× GPU, bs=64, seq=256 \| 207 docs/s, 17.9M/day \|
	\| Max throughput batch \| 1× GPU, bs=64, seq=64² \| 460 docs/s, 39.8M/day \|
	\| CPU offline / dev \| CPU, bs=1, seq≤256 \| 3–7 docs/s \|

	<sub>² At seq=64 most documents will be truncated. Use seq=128–256 for production balance.</sub>

	Key observations:
	- The model has a fixed ~105 ms overhead per forward pass regardless of sequence length (MoE routing + expert dispatch). Batching amortizes this cost across documents — the per-doc cost drops from 106 ms (bs=1) to under 10 ms (bs=64).
	- Memory is not the bottleneck — even at bs=64/seq=512 the model uses only 6.2 GB. You can run this on a T4 (16 GB) with room to spare.
	- Optimal batch size for throughput: bs=64 for all sequence lengths on A10G.
	- Optimal batch size for latency-constrained: bs=8–16 gives a good per-doc latency (13–19 ms) while keeping batch latency under 225 ms.

	---

	## Training Strategy

	Two-phase training approach:

	1. Phase 1 — Multi-task fine-tuning: Partially unfroze last 4 MoE layers + both task heads. Trained on 20K NER examples (ai4privacy) + 20K doc examples (Yahoo Answers). Multi-task loss (NER×1.0 + Doc×0.5). 2 epochs, LR=2e-5.

	2. Phase 2 — Doc head retraining (head-only): Froze entire backbone + NER head. Pre-computed 640-dim pooled features for 100K Yahoo Answers examples. Trained fresh `Linear(640→10)` classifier for 10 epochs, LR=1e-3, cosine decay. This approach:
	- Preserves NER performance exactly (backbone untouched)
	- Is extremely fast (~seconds per epoch on cached features)
	- Achieves 47.8% test accuracy (up from 24.8% in phase 1)

	## Usage

	```python
	import torch
	import torch.nn as nn
	from transformers import AutoModelForTokenClassification, AutoTokenizer
	from huggingface_hub import hf_hub_download

	# Load model + tokenizer
	tokenizer = AutoTokenizer.from_pretrained("binga/privacy-filter-multitask")
	model = AutoModelForTokenClassification.from_pretrained(
	"binga/privacy-filter-multitask", dtype=torch.bfloat16, device_map="auto"
	)

	# Load document classification head
	doc_head = nn.Linear(640, 10)
	doc_head.load_state_dict(torch.load(
	hf_hub_download("binga/privacy-filter-multitask", "doc_head.pt"),
	weights_only=True, map_location=model.device
	))
	doc_head = doc_head.to(dtype=torch.bfloat16, device=model.device)
	doc_head.eval()

	# Inference
	text = "John Smith (SSN: 123-45-6789) emailed john@corp.com about Q3 earnings."
	inputs = tokenizer(text, return_tensors="pt").to(model.device)

	with torch.no_grad():
	outputs = model(**inputs, output_hidden_states=True)

	# === PII Detection ===
	print("PII entities:")
	for tok, pred in zip(
	tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]),
	outputs.logits.argmax(-1)[0]
	):
	label = model.config.id2label[pred.item()]
	if label != "O":
	print(f" {tok} → {label}")

	# === Document Classification ===
	categories = [
	"Society & Culture", "Science & Math", "Health", "Education",
	"Computers & Internet", "Sports", "Business & Finance",
	"Entertainment", "Family", "Politics"
	]
	hidden = outputs.hidden_states[-1]
	mask = inputs["attention_mask"].unsqueeze(-1).to(hidden.dtype)
	pooled = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1)
	probs = torch.softmax(doc_head(pooled)[0].float(), dim=-1)
	top = probs.argmax().item()
	print(f"\nCategory: {categories[top]} ({probs[top]:.1%})")
	```

	### Batched Inference (Production)

	```python
	# Process a batch of documents — both tasks in a single forward pass
	texts = ["doc1...", "doc2...", "doc3...", ...]
	inputs = tokenizer(texts, return_tensors="pt", padding=True,
	truncation=True, max_length=256).to(model.device)

	with torch.no_grad():
	outputs = model(**inputs, output_hidden_states=True)

	# NER predictions for all docs: [batch, seq_len]
	ner_preds = outputs.logits.argmax(dim=-1)

	# Doc class for all docs: [batch]
	hidden = outputs.hidden_states[-1]
	mask = inputs["attention_mask"].unsqueeze(-1).to(hidden.dtype)
	pooled = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1)
	doc_preds = doc_head(pooled).argmax(dim=-1)
	```

	## Example Outputs

	\| Input \| PII Detected \| Category (confidence) \|
	\|-------\|-------------\|----------------------\|
	\| "My name is John Smith... email john@example.com" \| ✅ John Smith, john@example.com, 123 Main St \| Computers & Internet (56%) \|
	\| "Liverpool FC defeated Manchester City 3-1" \| ❌ None \| Sports (98%) \|
	\| "Federal Reserve announced a rate cut" \| ❌ None \| Politics (52%) \|
	\| "health benefits of meditation and yoga" \| ❌ None \| Health (38%) \|
	\| "Patient Jane Doe (SSN: 123-45-6789)" \| ✅ Jane Doe, 123-45-6789, jane.doe@hospital.com \| Education (41%) \|
	\| "learn programming? I want to learn Python" \| ❌ None \| Education (53%) \|
	\| "legal to record phone calls in California?" \| ❌ None \| Politics (64%) \|

	## Files

	\| File \| Size \| Description \|
	\|------\|------\|-------------\|
	\| `model.safetensors` \| 2.6 GB \| Backbone + NER head (1.4B MoE params) \|
	\| `doc_head.pt` \| 26 KB \| Document classification head (640→10) \|
	\| `config.json` \| 3 KB \| Model architecture config \|
	\| `tokenizer.json` \| 27 MB \| BPE tokenizer (o200k_base) \|
	\| `multitask_config.json` \| 349 B \| Multi-task metadata \|