docs: add GPU (CUDA) benchmark to model card + benchmark_gpu.json

378741a verified 13 days ago

15.5 kB

license: apache-2.0
base_model: openai/privacy-filter
language:
  - fr
library_name: transformers
pipeline_tag: token-classification
tags:
  - pii
  - privacy
  - token-classification
  - ner
  - bioes
  - french
  - insurance
  - crm
datasets:
  - ai4privacy/open-pii-masking-500k-ai4privacy
metrics:
  - f1
  - precision
  - recall
widget:
  - text: Bonjour, je m'appelle Alice Dupont et mon email est alice@acme.fr
  - text: >-
      Mon IBAN est FR76 3000 4000 0312 3456 7890 143 et mon téléphone le 06 12
      34 56 78.
  - text: >-
      Le sinistre N°2024-FR-98341 concerne M. Jean-Baptiste Leclerc, né le
      1987-05-12, au 15 rue de Rivoli 75001 Paris.
model-index:
  - name: openai-privacy-filter-fr
    results:
      - task:
          type: token-classification
          name: PII span detection (French)
        dataset:
          name: >-
            ai4privacy/open-pii-masking-500k-ai4privacy (French slice, held-out
            test)
          type: ai4privacy/open-pii-masking-500k-ai4privacy
          config: fr
          split: test
        metrics:
          - type: f1
            value: 0.9522
            name: Overall span-F1 (IOBES strict)
          - type: precision
            value: 0.96
            name: Overall precision
          - type: recall
            value: 0.9446
            name: Overall recall

openai-privacy-filter-fr

French fine-tune of openai/privacy-filter for PII detection in a French insurance CRM context (policyholder names, contact details, IBAN/RIB, addresses, dates of birth, claim / contract identifiers).

Full fine-tuning with AdamW 8-bit (bitsandbytes), bf16 autocast, on a single RTX 4090.

Results (held-out French test set, 1 035 examples)

Metric	Zero-shot baseline (`openai/privacy-filter`)	This model	Δ
Overall span-F1 (IOBES strict)	0.7068	0.9522	+0.2454 (+34.7 %)
Precision	0.8037	0.9600	+0.1563
Recall	0.6308	0.9446	+0.3138

Per-class F1 (span-level, strict IOBES)

Class	Baseline	This model	Δ
`private_email`	0.960	1.000	+0.040
`private_phone`	0.870	1.000	+0.130
`private_date`	0.652	0.997	+0.345
`account_number`	0.874	0.995	+0.121
`private_person`	0.683	0.931	+0.248
`private_address`	0.428	0.906	+0.478

(private_url and secret classes are preserved from the base model but not present in this test set, so not reported.)

Intended use

Designed for French-language on-premises PII redaction in enterprise flows: emails, chat logs, CRM notes, claim reports, scanned document transcripts. Primary target: insurance back-office (souscripteurs, sinistres), but the label set is generic enough for banking, healthcare admin, HR, and customer support.

Not suitable for:

Languages other than French (use the base model or retrain for your target language).
Content with no training-time analogue (e.g. medical free-text, legal case citations).
Final anonymisation guarantee — always combine with rule-based recognisers (Presidio) and human review for high-sensitivity workflows.

Label schema

Same as the base model: 33 classes = O + 8 entity types × 4 BIOES boundary tags.

Entity type	Covers
`private_person`	Policyholder names, usernames, titles (M., Mme., Dr.).
`private_email`	Personal email addresses.
`private_phone`	Phone numbers (mobile / landline / fax).
`private_address`	Street, building number, city, ZIP, state/country.
`account_number`	IBAN/RIB, credit card, BIC/SWIFT, customer/contract IDs, ID card, passport, tax and social numbers.
`private_date`	DOB, birth year, date/time references tied to a person.
`private_url`	Personal URLs / IP addresses. (preserved from base model; not retrained)
`secret`	API keys, passwords, tokens. (preserved from base model; not retrained)

Inference returns subword-level BIOES tags that the HuggingFace token-classification pipeline aggregates into spans.

How to use

from transformers import pipeline

nlp = pipeline(
    task="token-classification",
    model="YLOD/openai-privacy-filter-fr",
    aggregation_strategy="simple",
)

text = (
    "Bonjour, je suis Alice Dupont, née le 1987-05-12. "
    "Mon email : alice.dupont@acme.fr, mobile 06 12 34 56 78. "
    "IBAN : FR76 3000 4000 0312 3456 7890 143."
)
for span in nlp(text):
    print(f"[{span['entity_group']:>16}] {span['word']!r} ({span['score']:.3f})")

For a ready-to-use masker that merges adjacent subword spans correctly, see the demo script in the GitHub repo (if published) or reuse the merge_spans helper from the training code.

Training details

Data

Source: ai4privacy/open-pii-masking-500k-ai4privacy (CC-BY-4.0)
Language filter: language == "fr" → 89 670 examples available; 10 005 / 460 / 1 035 train / validation / test (seed 42)
Label mapping: 60+ source classes collapsed into the 8-class privacy-filter taxonomy (FIRSTNAME / LASTNAME / GIVENNAME / SURNAME / TITLE → private_person, TELEPHONENUM / PHONEIMEI → private_phone, BUILDINGNUM / CITY / ZIPCODE / STREET → private_address, IBAN / IDCARDNUM / PASSPORTNUM / TAXNUM / SOCIALNUM → account_number, etc.).
Alignment: char-offset spans aligned to subword tokens with strict BIOES at the subword level (first / middle / last subwords of a span get B- / I- / E-, singletons get S-). Whitespace-only subwords inside a span inherit I- to bridge IBAN-like groups.

Hyperparameters (optimum found via 8-iteration autoresearch sweep)

	Value
Base checkpoint	`openai/privacy-filter` (1.4 B params total, 50 M active — MoE)
Strategy	Full fine-tuning (all 1.4 B params trainable)
Optimizer	AdamW 8-bit (bitsandbytes)
Learning rate	2 × 10⁻⁴
Batch size	16 × grad-accum 2 = effective 32
Epochs	2
Warmup ratio	0.03
Weight decay	0.01
Max grad norm	1.0
Scheduler	cosine
Precision	bf16 autocast (fp32 master weights)
Gradient checkpointing	disabled (short sequences, ~30 tokens median)
Seed	0

Hardware

Single NVIDIA RTX 4090 (24 GB) capped at 225 W
WSL2 on Windows, PyTorch 2.11 + CUDA 13.0, transformers 5.6
~25 minutes wall-clock per training run

Noise floor

Seed-to-seed variance on the same config ≈ ±0.003 F1 (measured with seeds 0 and 1). Gains smaller than that are not meaningful.

Autoresearch iteration summary

#	Change	Test F1	Δ vs baseline	Outcome
0	— (zero-shot baseline)	0.7068	—	baseline
1	LR 1e-5 → 3e-4	0.9473	+0.2405	keep
2	LR 3e-4 → 2e-4	0.9522	+0.2454	keep (best)
3	LR 2e-4 → 1e-4	0.9356	+0.2288	discard
4	Epochs 2 → 3	0.9500	+0.2432	discard
5	Warmup 0.03 → 0.10	0.9382	+0.2314	discard
6	Weight decay 0.01 → 0.1	0.9492	+0.2424	discard
7	Seed 0 → 1 (noise check)	0.9491	+0.2423	discard

Limitations & ethical considerations

No privacy guarantee. ML-based PII detection can miss uncommon formats, aliased references, adversarial spacing, or novel identifier types. This model should always be paired with regex-based recognisers and human review for high-sensitivity outputs.
French-only distribution shift. Trained on French data only; performance on other languages will regress sharply from the base model baseline.
Synthetic data bias. ai4privacy is largely template-generated. Real-world free-text (handwritten claim descriptions, casual customer emails) may be underrepresented. A domain-specific holdout from your actual CRM is essential before production deployment.
address is the weakest class (F1 0.91). Ambiguous short addresses (single street name, abbreviations, PO boxes) are the main failure mode.
private_url and secret were not retrained. Their behaviour is inherited from the base model; if these matter in your domain, run a follow-up fine-tune that includes them.
Label collisions. When a token plausibly belongs to two classes (e.g. a phone number embedded in an address block), the model picks one; span splitting is not guaranteed to follow human intuition.
Not suitable for medical, legal or regulated decision-making without explicit compliance review.

ONNX quantized variants

In addition to the default PyTorch model.safetensors, this repository ships four ONNX variants under onnx/, benchmarked on the same French test set (1 035 examples). The ONNX graph reuses OpenAI's base-model export (which correctly handles the MoE routing and attention sinks) with the fine-tuned weights swapped in. INT8 and INT4 variants combine standard quantize_dynamic / MatMulNBitsQuantizer on the MatMul nodes with a custom MoE → QMoE conversion of the expert tensors (block-symmetric, block_size=32), so the MoE experts — which hold ~90 % of the parameters — are also quantized.

Variant	F1	Δ F1	Precision	Recall	File size	Compression	CPU latency p50	CPU latency p95
PyTorch (transformers)	0.9522	—	0.9595	0.9451	2.80 GB	1.0×	99.5 ms	144.8 ms
ONNX fp32 (`onnx/model.onnx`)	0.9522	0.0000	0.9595	0.9451	5.63 GB	0.5×	38.4 ms	53.0 ms
ONNX fp16 (`onnx/model_fp16.onnx`)	0.9517	-0.0005	0.9584	0.9451	2.82 GB	1.0×	39.3 ms	54.6 ms
ONNX INT8 (`onnx/model_int8.onnx`)	0.9516	-0.0006	0.9605	0.9429	1.60 GB	3.5×	52.7 ms	68.0 ms
ONNX INT4 (`onnx/model_int4.onnx`)	0.9509	-0.0013	0.9573	0.9446	1.35 GB	4.2×	344.8 ms ⚠	428.0 ms

Benchmark setup: 1 035 FR test examples, batch size 1, single-threaded ONNX Runtime (CPU provider) on AMD Ryzen-class CPU, WSL2. Compression is vs PyTorch fp32. Memory readings via ru_maxrss (~9 GB across all variants because ORT mem-maps external-data files instead of loading them fully — RSS doesn't reflect the actual resident set for mmapped data).

Key findings

ONNX fp32 is 2.6× faster than PyTorch at identical precision and F1 (graph optimisation, fused ops, no Python-side MoE loop).
INT8 is the practical sweet spot on CPU: 3.5× smaller than PyTorch fp32 (1.60 GB vs 2.80 GB original, 5.63 GB vs fp32 ONNX), F1 unchanged within the training noise floor (±0.003), and still ~2× faster than PyTorch.
INT4 gives the smallest footprint (1.35 GB, 4.2× compression) with a negligible F1 loss, but the CPU QMoE kernel for int4 is not as optimized as its int8 counterpart — expect ~9× slowdown on CPU. INT4 is best suited for GPU inference or specialized runtimes (CUDA, OpenVINO, WebGPU via Transformers.js) where the int4 dequant path is kernel-fused.
All four variants are within the training noise floor (±0.003) on overall F1, so pick based on the target runtime and memory budget.

GPU (CUDA) benchmark (RTX 4090, ONNX Runtime 1.25 CUDA EP)

Variant	F1	Size	CUDA latency p50	CUDA latency p95	Notes
ONNX fp32	0.9522	5.63 GB	5.0 ms	21.5 ms	MoE CUDA kernel (FasterTransformer) — fastest
ONNX fp16	—	2.82 GB	fail	—	MoE FT kernel templated for SM80, fails on Ada (SM89) in ORT 1.25
ONNX INT8	0.9508	1.60 GB	68.7 ms	89.4 ms	QMoE CUDA int8 kernel currently unoptimized
ONNX INT4	0.9506	1.35 GB	347.1 ms	426.2 ms	Same — QMoE CUDA int4 path unoptimized in 1.25

GPU takeaway: the ONNX fp32 graph benefits from a highly optimized FasterTransformer-based MoE kernel and reaches 5 ms / example (~200 ex/s) — a 7.7× speed-up over CPU. The quantized (QMoE) CUDA kernels exist and run correctly but are currently much slower than the fp32 kernel, so the quantized variants are not currently recommended for latency-critical GPU inference. Their value on GPU is memory footprint (1.3 – 1.6 GB of VRAM) rather than speed. Future ORT releases, or using TensorRT-LLM / custom kernels, should close this gap.

The fp16 failure on Ada (RTX 4090, SM89) stems from the bundled CUTLASS MoE GEMM being templated against SM80 — a shared-memory check rejects the kernel at launch. An ORT build rebuilt with SM89 kernels, or running on A100/A10/H100, should restore fp16 MoE support.

Quantization details

fp16: whole-graph float16 cast (onnxconverter_common.float16.convert_float_to_float16).
INT8 = quantize_dynamic (per-channel int8) on regular MatMul / Gemm nodes + block-symmetric int8 QMoE on the expert tensors (block_size=32).
INT4 = MatMulNBitsQuantizer (4-bit weight-only) on regular MatMul / Gemm nodes + block-symmetric int4 QMoE on the expert tensors (block_size=32, symmetric, default zero-point 2^(bits-1)).
Quantization script for the MoE part is included in the training repo as training/quantize_moe.py — the stock ORT quantizers don't crack open the custom com.microsoft.MoE op, so we manually block-quantize gate_up_proj and down_proj per expert and rewrite the node to com.microsoft.QMoE.

How to use the ONNX variants

import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("YLOD/openai-privacy-filter-fr")
# Download one variant (fp16 recommended for smallest size with no quality loss):
# huggingface-cli download YLOD/openai-privacy-filter-fr onnx/model_fp16.onnx onnx/model_fp16.onnx_data

sess = ort.InferenceSession(
    "onnx/model_fp16.onnx",
    providers=["CPUExecutionProvider"],  # or ["CUDAExecutionProvider"] on GPU
)

text = "Alice Dupont, IBAN FR76 3000 4000 0312 3456 7890 143, née le 1987-05-12."
enc = tok(text, return_tensors="np")
logits = sess.run(
    ["logits"],
    {"input_ids": enc["input_ids"].astype(np.int64),
     "attention_mask": enc["attention_mask"].astype(np.int64)},
)[0]

pred_ids = logits[0].argmax(-1)
id2label = {0: "O"}  # (full 33-class map: see config.json)
# → merge adjacent BIOES subword tags into spans as per the base model card

For a ready-to-use span merger compatible with the model's subword-level BIOES outputs, see the training repo that produced this model.

License

Apache-2.0, inherited from openai/privacy-filter. The training data (ai4privacy/open-pii-masking-500k-ai4privacy) is CC-BY-4.0 — attribution preserved above.

Citation

If you use this model, please cite the underlying base model and dataset:

@misc{openai2026privacyfilter,
  title  = {OpenAI Privacy Filter},
  author = {OpenAI},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/openai/privacy-filter}},
}

@dataset{ai4privacy_open_pii_500k,
  title  = {Open PII Masking 500k (ai4privacy)},
  author = {ai4privacy},
  howpublished = {\url{https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy}},
  license = {CC-BY-4.0},
}