docs: add GPU (CUDA) benchmark to model card + benchmark_gpu.json

378741a verified 13 days ago

15.5 kB

	---
	license: apache-2.0
	base_model: openai/privacy-filter
	language:
	- fr
	library_name: transformers
	pipeline_tag: token-classification
	tags:
	- pii
	- privacy
	- token-classification
	- ner
	- bioes
	- french
	- insurance
	- crm
	datasets:
	- ai4privacy/open-pii-masking-500k-ai4privacy
	metrics:
	- f1
	- precision
	- recall
	widget:
	- text: "Bonjour, je m'appelle Alice Dupont et mon email est alice@acme.fr"
	- text: "Mon IBAN est FR76 3000 4000 0312 3456 7890 143 et mon téléphone le 06 12 34 56 78."
	- text: "Le sinistre N°2024-FR-98341 concerne M. Jean-Baptiste Leclerc, né le 1987-05-12, au 15 rue de Rivoli 75001 Paris."
	model-index:
	- name: openai-privacy-filter-fr
	results:
	- task:
	type: token-classification
	name: PII span detection (French)
	dataset:
	name: ai4privacy/open-pii-masking-500k-ai4privacy (French slice, held-out test)
	type: ai4privacy/open-pii-masking-500k-ai4privacy
	config: fr
	split: test
	metrics:
	- type: f1
	value: 0.9522
	name: Overall span-F1 (IOBES strict)
	- type: precision
	value: 0.9600
	name: Overall precision
	- type: recall
	value: 0.9446
	name: Overall recall
	---

	# openai-privacy-filter-fr

	French fine-tune of [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) for PII detection in a French insurance CRM context (policyholder names, contact details, IBAN/RIB, addresses, dates of birth, claim / contract identifiers).

	Full fine-tuning with AdamW 8-bit (bitsandbytes), bf16 autocast, on a single RTX 4090.

	---

	## Results (held-out French test set, 1 035 examples)

	\| Metric \| Zero-shot baseline (`openai/privacy-filter`) \| This model \| Δ \|
	\|---\|---:\|---:\|---:\|
	\| Overall span-F1 (IOBES strict) \| 0.7068 \| 0.9522 \| +0.2454 (+34.7 %) \|
	\| Precision \| 0.8037 \| 0.9600 \| +0.1563 \|
	\| Recall \| 0.6308 \| 0.9446 \| +0.3138 \|

	### Per-class F1 (span-level, strict IOBES)

	\| Class \| Baseline \| This model \| Δ \|
	\|---\|---:\|---:\|---:\|
	\| `private_email` \| 0.960 \| 1.000 \| +0.040 \|
	\| `private_phone` \| 0.870 \| 1.000 \| +0.130 \|
	\| `private_date` \| 0.652 \| 0.997 \| +0.345 \|
	\| `account_number` \| 0.874 \| 0.995 \| +0.121 \|
	\| `private_person` \| 0.683 \| 0.931 \| +0.248 \|
	\| `private_address` \| 0.428 \| 0.906 \| +0.478 \|

	(`private_url` and `secret` classes are preserved from the base model but not present in this test set, so not reported.)

	---

	## Intended use

	Designed for French-language on-premises PII redaction in enterprise flows: emails, chat logs, CRM notes, claim reports, scanned document transcripts. Primary target: insurance back-office (souscripteurs, sinistres), but the label set is generic enough for banking, healthcare admin, HR, and customer support.

	Not suitable for:
	- Languages other than French (use the base model or retrain for your target language).
	- Content with no training-time analogue (e.g. medical free-text, legal case citations).
	- Final anonymisation guarantee — always combine with rule-based recognisers (Presidio) and human review for high-sensitivity workflows.

	---

	## Label schema

	Same as the base model: 33 classes = `O` + 8 entity types × 4 BIOES boundary tags.

	\| Entity type \| Covers \|
	\|---\|---\|
	\| `private_person` \| Policyholder names, usernames, titles (M., Mme., Dr.). \|
	\| `private_email` \| Personal email addresses. \|
	\| `private_phone` \| Phone numbers (mobile / landline / fax). \|
	\| `private_address` \| Street, building number, city, ZIP, state/country. \|
	\| `account_number` \| IBAN/RIB, credit card, BIC/SWIFT, customer/contract IDs, ID card, passport, tax and social numbers. \|
	\| `private_date` \| DOB, birth year, date/time references tied to a person. \|
	\| `private_url` \| Personal URLs / IP addresses. (preserved from base model; not retrained) \|
	\| `secret` \| API keys, passwords, tokens. (preserved from base model; not retrained) \|

	Inference returns subword-level BIOES tags that the HuggingFace `token-classification` pipeline aggregates into spans.

	---

	## How to use

	```python
	from transformers import pipeline

	nlp = pipeline(
	task="token-classification",
	model="YLOD/openai-privacy-filter-fr",
	aggregation_strategy="simple",
	)

	text = (
	"Bonjour, je suis Alice Dupont, née le 1987-05-12. "
	"Mon email : alice.dupont@acme.fr, mobile 06 12 34 56 78. "
	"IBAN : FR76 3000 4000 0312 3456 7890 143."
	)
	for span in nlp(text):
	print(f"[{span['entity_group']:>16}] {span['word']!r} ({span['score']:.3f})")
	```

	For a ready-to-use masker that merges adjacent subword spans correctly, see the [demo script in the GitHub repo](https://github.com/autoresearch-demo/privacy-filter-fr) (if published) or reuse the `merge_spans` helper from the training code.

	---

	## Training details

	### Data

	- Source: [`ai4privacy/open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy) (CC-BY-4.0)
	- Language filter: `language == "fr"` → 89 670 examples available; 10 005 / 460 / 1 035 train / validation / test (seed 42)
	- Label mapping: 60+ source classes collapsed into the 8-class privacy-filter taxonomy (`FIRSTNAME / LASTNAME / GIVENNAME / SURNAME / TITLE → private_person`, `TELEPHONENUM / PHONEIMEI → private_phone`, `BUILDINGNUM / CITY / ZIPCODE / STREET → private_address`, `IBAN / IDCARDNUM / PASSPORTNUM / TAXNUM / SOCIALNUM → account_number`, etc.).
	- Alignment: char-offset spans aligned to subword tokens with strict BIOES at the subword level (first / middle / last subwords of a span get `B-` / `I-` / `E-`, singletons get `S-`). Whitespace-only subwords inside a span inherit `I-` to bridge IBAN-like groups.

	### Hyperparameters (optimum found via 8-iteration autoresearch sweep)

	\| \| Value \|
	\|---\|---\|
	\| Base checkpoint \| `openai/privacy-filter` (1.4 B params total, 50 M active — MoE) \|
	\| Strategy \| Full fine-tuning (all 1.4 B params trainable) \|
	\| Optimizer \| AdamW 8-bit (bitsandbytes) \|
	\| Learning rate \| 2 × 10⁻⁴ \|
	\| Batch size \| 16 × grad-accum 2 = effective 32 \|
	\| Epochs \| 2 \|
	\| Warmup ratio \| 0.03 \|
	\| Weight decay \| 0.01 \|
	\| Max grad norm \| 1.0 \|
	\| Scheduler \| cosine \|
	\| Precision \| bf16 autocast (fp32 master weights) \|
	\| Gradient checkpointing \| disabled (short sequences, ~30 tokens median) \|
	\| Seed \| 0 \|

	### Hardware

	- Single NVIDIA RTX 4090 (24 GB) capped at 225 W
	- WSL2 on Windows, PyTorch 2.11 + CUDA 13.0, transformers 5.6
	- ~25 minutes wall-clock per training run

	### Noise floor

	Seed-to-seed variance on the same config ≈ ±0.003 F1 (measured with seeds 0 and 1). Gains smaller than that are not meaningful.

	### Autoresearch iteration summary

	\| # \| Change \| Test F1 \| Δ vs baseline \| Outcome \|
	\|---\|---\|---:\|---:\|---\|
	\| 0 \| — (zero-shot baseline) \| 0.7068 \| — \| baseline \|
	\| 1 \| LR 1e-5 → 3e-4 \| 0.9473 \| +0.2405 \| keep \|
	\| 2 \| LR 3e-4 → 2e-4 \| 0.9522 \| +0.2454 \| keep (best) \|
	\| 3 \| LR 2e-4 → 1e-4 \| 0.9356 \| +0.2288 \| discard \|
	\| 4 \| Epochs 2 → 3 \| 0.9500 \| +0.2432 \| discard \|
	\| 5 \| Warmup 0.03 → 0.10 \| 0.9382 \| +0.2314 \| discard \|
	\| 6 \| Weight decay 0.01 → 0.1 \| 0.9492 \| +0.2424 \| discard \|
	\| 7 \| Seed 0 → 1 (noise check) \| 0.9491 \| +0.2423 \| discard \|

	---

	## Limitations & ethical considerations

	- No privacy guarantee. ML-based PII detection can miss uncommon formats, aliased references, adversarial spacing, or novel identifier types. This model should always be paired with regex-based recognisers and human review for high-sensitivity outputs.
	- French-only distribution shift. Trained on French data only; performance on other languages will regress sharply from the base model baseline.
	- Synthetic data bias. ai4privacy is largely template-generated. Real-world free-text (handwritten claim descriptions, casual customer emails) may be underrepresented. A domain-specific holdout from your actual CRM is essential before production deployment.
	- `address` is the weakest class (F1 0.91). Ambiguous short addresses (single street name, abbreviations, PO boxes) are the main failure mode.
	- `private_url` and `secret` were not retrained. Their behaviour is inherited from the base model; if these matter in your domain, run a follow-up fine-tune that includes them.
	- Label collisions. When a token plausibly belongs to two classes (e.g. a phone number embedded in an address block), the model picks one; span splitting is not guaranteed to follow human intuition.
	- Not suitable for medical, legal or regulated decision-making without explicit compliance review.

	## ONNX quantized variants

	In addition to the default PyTorch `model.safetensors`, this repository ships four
	ONNX variants under `onnx/`, benchmarked on the same French test set (1 035 examples).
	The ONNX graph reuses OpenAI's base-model export (which correctly handles the MoE
	routing and attention sinks) with the fine-tuned weights swapped in. INT8 and INT4
	variants combine standard `quantize_dynamic` / `MatMulNBitsQuantizer` on the MatMul
	nodes with a custom `MoE → QMoE` conversion of the expert tensors (block-symmetric,
	block_size=32), so the MoE experts — which hold ~90 % of the parameters — are also
	quantized.

	\| Variant \| F1 \| Δ F1 \| Precision \| Recall \| File size \| Compression \| CPU latency p50 \| CPU latency p95 \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| PyTorch (transformers) \| 0.9522 \| — \| 0.9595 \| 0.9451 \| 2.80 GB \| 1.0× \| 99.5 ms \| 144.8 ms \|
	\| ONNX fp32 (`onnx/model.onnx`) \| 0.9522 \| 0.0000 \| 0.9595 \| 0.9451 \| 5.63 GB \| 0.5× \| 38.4 ms \| 53.0 ms \|
	\| ONNX fp16 (`onnx/model_fp16.onnx`) \| 0.9517 \| -0.0005 \| 0.9584 \| 0.9451 \| 2.82 GB \| 1.0× \| 39.3 ms \| 54.6 ms \|
	\| ONNX INT8 (`onnx/model_int8.onnx`) \| 0.9516 \| -0.0006 \| 0.9605 \| 0.9429 \| 1.60 GB \| 3.5× \| 52.7 ms \| 68.0 ms \|
	\| ONNX INT4 (`onnx/model_int4.onnx`) \| 0.9509 \| -0.0013 \| 0.9573 \| 0.9446 \| 1.35 GB \| 4.2× \| 344.8 ms ⚠ \| 428.0 ms \|

	*Benchmark setup: 1 035 FR test examples, batch size 1, single-threaded ONNX Runtime
	(CPU provider) on AMD Ryzen-class CPU, WSL2. Compression is vs PyTorch fp32. Memory
	readings via `ru_maxrss` (~9 GB across all variants because ORT mem-maps external-data
	files instead of loading them fully — RSS doesn't reflect the actual resident set
	for mmapped data).*

	### Key findings

	- ONNX fp32 is 2.6× faster than PyTorch at identical precision and F1 (graph
	optimisation, fused ops, no Python-side MoE loop).
	- INT8 is the practical sweet spot on CPU: 3.5× smaller than PyTorch fp32
	(1.60 GB vs 2.80 GB original, 5.63 GB vs fp32 ONNX), F1 unchanged within the
	training noise floor (±0.003), and still ~2× faster than PyTorch.
	- INT4 gives the smallest footprint (1.35 GB, 4.2× compression) with a
	negligible F1 loss, but the CPU `QMoE` kernel for int4 is not as optimized as
	its int8 counterpart — expect ~9× slowdown on CPU. INT4 is best suited for
	GPU inference or specialized runtimes (CUDA, OpenVINO, WebGPU via Transformers.js)
	where the int4 dequant path is kernel-fused.
	- All four variants are within the training noise floor (±0.003) on overall F1,
	so pick based on the target runtime and memory budget.

	### GPU (CUDA) benchmark (RTX 4090, ONNX Runtime 1.25 CUDA EP)

	\| Variant \| F1 \| Size \| CUDA latency p50 \| CUDA latency p95 \| Notes \|
	\|---\|---:\|---:\|---:\|---:\|---\|
	\| ONNX fp32 \| 0.9522 \| 5.63 GB \| 5.0 ms \| 21.5 ms \| MoE CUDA kernel (FasterTransformer) — fastest \|
	\| ONNX fp16 \| — \| 2.82 GB \| fail \| — \| MoE FT kernel templated for SM80, fails on Ada (SM89) in ORT 1.25 \|
	\| ONNX INT8 \| 0.9508 \| 1.60 GB \| 68.7 ms \| 89.4 ms \| QMoE CUDA int8 kernel currently unoptimized \|
	\| ONNX INT4 \| 0.9506 \| 1.35 GB \| 347.1 ms \| 426.2 ms \| Same — QMoE CUDA int4 path unoptimized in 1.25 \|

	GPU takeaway: the ONNX fp32 graph benefits from a highly optimized
	FasterTransformer-based MoE kernel and reaches 5 ms / example (~200 ex/s) —
	a 7.7× speed-up over CPU. The quantized (`QMoE`) CUDA kernels exist and run
	correctly but are currently much slower than the fp32 kernel, so the quantized
	variants are not currently recommended for latency-critical GPU inference.
	Their value on GPU is memory footprint (1.3 – 1.6 GB of VRAM) rather than speed.
	Future ORT releases, or using TensorRT-LLM / custom kernels, should close this gap.

	The fp16 failure on Ada (RTX 4090, SM89) stems from the bundled CUTLASS MoE
	GEMM being templated against SM80 — a shared-memory check rejects the kernel
	at launch. An ORT build rebuilt with SM89 kernels, or running on A100/A10/H100,
	should restore fp16 MoE support.

	### Quantization details

	- fp16: whole-graph float16 cast (`onnxconverter_common.float16.convert_float_to_float16`).
	- INT8 = `quantize_dynamic` (per-channel int8) on regular `MatMul` / `Gemm` nodes
	+ block-symmetric int8 QMoE on the expert tensors (`block_size=32`).
	- INT4 = `MatMulNBitsQuantizer` (4-bit weight-only) on regular `MatMul` / `Gemm`
	nodes + block-symmetric int4 QMoE on the expert tensors (`block_size=32`,
	symmetric, default zero-point 2^(bits-1)).
	- Quantization script for the MoE part is included in the training repo as
	`training/quantize_moe.py` — the stock ORT quantizers don't crack open the
	custom `com.microsoft.MoE` op, so we manually block-quantize `gate_up_proj` and
	`down_proj` per expert and rewrite the node to `com.microsoft.QMoE`.

	### How to use the ONNX variants

	```python
	import numpy as np
	import onnxruntime as ort
	from transformers import AutoTokenizer

	tok = AutoTokenizer.from_pretrained("YLOD/openai-privacy-filter-fr")
	# Download one variant (fp16 recommended for smallest size with no quality loss):
	# huggingface-cli download YLOD/openai-privacy-filter-fr onnx/model_fp16.onnx onnx/model_fp16.onnx_data

	sess = ort.InferenceSession(
	"onnx/model_fp16.onnx",
	providers=["CPUExecutionProvider"], # or ["CUDAExecutionProvider"] on GPU
	)

	text = "Alice Dupont, IBAN FR76 3000 4000 0312 3456 7890 143, née le 1987-05-12."
	enc = tok(text, return_tensors="np")
	logits = sess.run(
	["logits"],
	{"input_ids": enc["input_ids"].astype(np.int64),
	"attention_mask": enc["attention_mask"].astype(np.int64)},
	)[0]

	pred_ids = logits[0].argmax(-1)
	id2label = {0: "O"} # (full 33-class map: see config.json)
	# → merge adjacent BIOES subword tags into spans as per the base model card
	```

	For a ready-to-use span merger compatible with the model's subword-level BIOES outputs,
	see the training repo that produced this model.

	## License

	Apache-2.0, inherited from [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter). The training data ([`ai4privacy/open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy)) is CC-BY-4.0 — attribution preserved above.

	## Citation

	If you use this model, please cite the underlying base model and dataset:

	```bibtex
	@misc{openai2026privacyfilter,
	title = {OpenAI Privacy Filter},
	author = {OpenAI},
	year = {2026},
	howpublished = {\url{https://huggingface.co/openai/privacy-filter}},
	}

	@dataset{ai4privacy_open_pii_500k,
	title = {Open PII Masking 500k (ai4privacy)},
	author = {ai4privacy},
	howpublished = {\url{https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy}},
	license = {CC-BY-4.0},
	}
	```