YLOD's picture
docs: add GPU (CUDA) benchmark to model card + benchmark_gpu.json
378741a verified
---
license: apache-2.0
base_model: openai/privacy-filter
language:
- fr
library_name: transformers
pipeline_tag: token-classification
tags:
- pii
- privacy
- token-classification
- ner
- bioes
- french
- insurance
- crm
datasets:
- ai4privacy/open-pii-masking-500k-ai4privacy
metrics:
- f1
- precision
- recall
widget:
- text: "Bonjour, je m'appelle Alice Dupont et mon email est alice@acme.fr"
- text: "Mon IBAN est FR76 3000 4000 0312 3456 7890 143 et mon téléphone le 06 12 34 56 78."
- text: "Le sinistre N°2024-FR-98341 concerne M. Jean-Baptiste Leclerc, né le 1987-05-12, au 15 rue de Rivoli 75001 Paris."
model-index:
- name: openai-privacy-filter-fr
results:
- task:
type: token-classification
name: PII span detection (French)
dataset:
name: ai4privacy/open-pii-masking-500k-ai4privacy (French slice, held-out test)
type: ai4privacy/open-pii-masking-500k-ai4privacy
config: fr
split: test
metrics:
- type: f1
value: 0.9522
name: Overall span-F1 (IOBES strict)
- type: precision
value: 0.9600
name: Overall precision
- type: recall
value: 0.9446
name: Overall recall
---
# openai-privacy-filter-fr
French fine-tune of [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) for PII detection in a **French insurance CRM** context (policyholder names, contact details, IBAN/RIB, addresses, dates of birth, claim / contract identifiers).
Full fine-tuning with AdamW 8-bit (bitsandbytes), bf16 autocast, on a single RTX 4090.
---
## Results (held-out French test set, 1 035 examples)
| Metric | Zero-shot baseline (`openai/privacy-filter`) | This model | Δ |
|---|---:|---:|---:|
| **Overall span-F1 (IOBES strict)** | 0.7068 | **0.9522** | **+0.2454 (+34.7 %)** |
| Precision | 0.8037 | 0.9600 | +0.1563 |
| Recall | 0.6308 | 0.9446 | +0.3138 |
### Per-class F1 (span-level, strict IOBES)
| Class | Baseline | This model | Δ |
|---|---:|---:|---:|
| `private_email` | 0.960 | **1.000** | +0.040 |
| `private_phone` | 0.870 | **1.000** | +0.130 |
| `private_date` | 0.652 | **0.997** | +0.345 |
| `account_number` | 0.874 | **0.995** | +0.121 |
| `private_person` | 0.683 | 0.931 | +0.248 |
| `private_address` | 0.428 | 0.906 | **+0.478** |
*(`private_url` and `secret` classes are preserved from the base model but not present in this test set, so not reported.)*
---
## Intended use
Designed for **French-language on-premises PII redaction** in enterprise flows: emails, chat logs, CRM notes, claim reports, scanned document transcripts. Primary target: insurance back-office (souscripteurs, sinistres), but the label set is generic enough for banking, healthcare admin, HR, and customer support.
**Not suitable for:**
- Languages other than French (use the base model or retrain for your target language).
- Content with no training-time analogue (e.g. medical free-text, legal case citations).
- Final anonymisation guarantee — always combine with rule-based recognisers (Presidio) and human review for high-sensitivity workflows.
---
## Label schema
Same as the base model: 33 classes = `O` + 8 entity types × 4 BIOES boundary tags.
| Entity type | Covers |
|---|---|
| `private_person` | Policyholder names, usernames, titles (M., Mme., Dr.). |
| `private_email` | Personal email addresses. |
| `private_phone` | Phone numbers (mobile / landline / fax). |
| `private_address` | Street, building number, city, ZIP, state/country. |
| `account_number` | IBAN/RIB, credit card, BIC/SWIFT, customer/contract IDs, ID card, passport, tax and social numbers. |
| `private_date` | DOB, birth year, date/time references tied to a person. |
| `private_url` | Personal URLs / IP addresses. *(preserved from base model; not retrained)* |
| `secret` | API keys, passwords, tokens. *(preserved from base model; not retrained)* |
Inference returns subword-level BIOES tags that the HuggingFace `token-classification` pipeline aggregates into spans.
---
## How to use
```python
from transformers import pipeline
nlp = pipeline(
task="token-classification",
model="YLOD/openai-privacy-filter-fr",
aggregation_strategy="simple",
)
text = (
"Bonjour, je suis Alice Dupont, née le 1987-05-12. "
"Mon email : alice.dupont@acme.fr, mobile 06 12 34 56 78. "
"IBAN : FR76 3000 4000 0312 3456 7890 143."
)
for span in nlp(text):
print(f"[{span['entity_group']:>16}] {span['word']!r} ({span['score']:.3f})")
```
For a ready-to-use masker that merges adjacent subword spans correctly, see the [demo script in the GitHub repo](https://github.com/autoresearch-demo/privacy-filter-fr) (if published) or reuse the `merge_spans` helper from the training code.
---
## Training details
### Data
- **Source**: [`ai4privacy/open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy) (CC-BY-4.0)
- **Language filter**: `language == "fr"` → 89 670 examples available; **10 005 / 460 / 1 035** train / validation / test (seed 42)
- **Label mapping**: 60+ source classes collapsed into the 8-class privacy-filter taxonomy (`FIRSTNAME / LASTNAME / GIVENNAME / SURNAME / TITLE → private_person`, `TELEPHONENUM / PHONEIMEI → private_phone`, `BUILDINGNUM / CITY / ZIPCODE / STREET → private_address`, `IBAN / IDCARDNUM / PASSPORTNUM / TAXNUM / SOCIALNUM → account_number`, etc.).
- **Alignment**: char-offset spans aligned to subword tokens with strict BIOES at the subword level (first / middle / last subwords of a span get `B-` / `I-` / `E-`, singletons get `S-`). Whitespace-only subwords inside a span inherit `I-` to bridge IBAN-like groups.
### Hyperparameters (optimum found via 8-iteration autoresearch sweep)
| | Value |
|---|---|
| Base checkpoint | `openai/privacy-filter` (1.4 B params total, 50 M active — MoE) |
| Strategy | **Full fine-tuning** (all 1.4 B params trainable) |
| Optimizer | **AdamW 8-bit** (bitsandbytes) |
| Learning rate | **2 × 10⁻⁴** |
| Batch size | 16 × grad-accum 2 = effective 32 |
| Epochs | 2 |
| Warmup ratio | 0.03 |
| Weight decay | 0.01 |
| Max grad norm | 1.0 |
| Scheduler | cosine |
| Precision | bf16 autocast (fp32 master weights) |
| Gradient checkpointing | disabled (short sequences, ~30 tokens median) |
| Seed | 0 |
### Hardware
- Single **NVIDIA RTX 4090 (24 GB)** capped at **225 W**
- WSL2 on Windows, PyTorch 2.11 + CUDA 13.0, transformers 5.6
- ~25 minutes wall-clock per training run
### Noise floor
Seed-to-seed variance on the same config ≈ **±0.003 F1** (measured with seeds 0 and 1). Gains smaller than that are not meaningful.
### Autoresearch iteration summary
| # | Change | Test F1 | Δ vs baseline | Outcome |
|---|---|---:|---:|---|
| 0 | — (zero-shot baseline) | 0.7068 | — | baseline |
| 1 | LR 1e-5 → 3e-4 | 0.9473 | +0.2405 | keep |
| **2** | **LR 3e-4 → 2e-4** | **0.9522** | **+0.2454** | **keep (best)** |
| 3 | LR 2e-4 → 1e-4 | 0.9356 | +0.2288 | discard |
| 4 | Epochs 2 → 3 | 0.9500 | +0.2432 | discard |
| 5 | Warmup 0.03 → 0.10 | 0.9382 | +0.2314 | discard |
| 6 | Weight decay 0.01 → 0.1 | 0.9492 | +0.2424 | discard |
| 7 | Seed 0 → 1 (noise check) | 0.9491 | +0.2423 | discard |
---
## Limitations & ethical considerations
- **No privacy guarantee.** ML-based PII detection can miss uncommon formats, aliased references, adversarial spacing, or novel identifier types. This model should always be paired with regex-based recognisers and human review for high-sensitivity outputs.
- **French-only distribution shift.** Trained on French data only; performance on other languages will regress sharply from the base model baseline.
- **Synthetic data bias.** ai4privacy is largely template-generated. Real-world free-text (handwritten claim descriptions, casual customer emails) may be underrepresented. A domain-specific holdout from your actual CRM is essential before production deployment.
- **`address` is the weakest class (F1 0.91).** Ambiguous short addresses (single street name, abbreviations, PO boxes) are the main failure mode.
- **`private_url` and `secret` were not retrained.** Their behaviour is inherited from the base model; if these matter in your domain, run a follow-up fine-tune that includes them.
- **Label collisions.** When a token plausibly belongs to two classes (e.g. a phone number embedded in an address block), the model picks one; span splitting is not guaranteed to follow human intuition.
- **Not suitable for medical, legal or regulated decision-making** without explicit compliance review.
## ONNX quantized variants
In addition to the default PyTorch `model.safetensors`, this repository ships four
ONNX variants under `onnx/`, benchmarked on the same French test set (1 035 examples).
The ONNX graph reuses OpenAI's base-model export (which correctly handles the MoE
routing and attention sinks) with the fine-tuned weights swapped in. INT8 and INT4
variants combine standard `quantize_dynamic` / `MatMulNBitsQuantizer` on the MatMul
nodes **with** a custom `MoE → QMoE` conversion of the expert tensors (block-symmetric,
block_size=32), so the MoE experts — which hold ~90 % of the parameters — are also
quantized.
| Variant | F1 | Δ F1 | Precision | Recall | File size | Compression | CPU latency p50 | CPU latency p95 |
|---|---:|---:|---:|---:|---:|---:|---:|---:|
| **PyTorch (transformers)** | 0.9522 | — | 0.9595 | 0.9451 | 2.80 GB | 1.0× | 99.5 ms | 144.8 ms |
| **ONNX fp32** (`onnx/model.onnx`) | 0.9522 | 0.0000 | 0.9595 | 0.9451 | 5.63 GB | 0.5× | **38.4 ms** | 53.0 ms |
| **ONNX fp16** (`onnx/model_fp16.onnx`) | 0.9517 | -0.0005 | 0.9584 | 0.9451 | 2.82 GB | 1.0× | 39.3 ms | 54.6 ms |
| **ONNX INT8** (`onnx/model_int8.onnx`) | 0.9516 | -0.0006 | 0.9605 | 0.9429 | **1.60 GB** | **3.5×** | 52.7 ms | 68.0 ms |
| **ONNX INT4** (`onnx/model_int4.onnx`) | 0.9509 | -0.0013 | 0.9573 | 0.9446 | **1.35 GB** | **4.2×** | 344.8 ms ⚠ | 428.0 ms |
*Benchmark setup: 1 035 FR test examples, batch size 1, single-threaded ONNX Runtime
(CPU provider) on AMD Ryzen-class CPU, WSL2. Compression is vs PyTorch fp32. Memory
readings via `ru_maxrss` (~9 GB across all variants because ORT mem-maps external-data
files instead of loading them fully — RSS doesn't reflect the actual resident set
for mmapped data).*
### Key findings
- **ONNX fp32 is 2.6× faster than PyTorch** at identical precision and F1 (graph
optimisation, fused ops, no Python-side MoE loop).
- **INT8 is the practical sweet spot on CPU**: 3.5× smaller than PyTorch fp32
(1.60 GB vs 2.80 GB original, 5.63 GB vs fp32 ONNX), F1 unchanged within the
training noise floor (±0.003), and still ~2× faster than PyTorch.
- **INT4 gives the smallest footprint** (1.35 GB, 4.2× compression) with a
negligible F1 loss, but the CPU `QMoE` kernel for int4 is not as optimized as
its int8 counterpart — **expect ~9× slowdown on CPU**. INT4 is best suited for
GPU inference or specialized runtimes (CUDA, OpenVINO, WebGPU via Transformers.js)
where the int4 dequant path is kernel-fused.
- All four variants are within the training noise floor (±0.003) on overall F1,
so pick based on the target runtime and memory budget.
### GPU (CUDA) benchmark (RTX 4090, ONNX Runtime 1.25 CUDA EP)
| Variant | F1 | Size | CUDA latency p50 | CUDA latency p95 | Notes |
|---|---:|---:|---:|---:|---|
| **ONNX fp32** | 0.9522 | 5.63 GB | **5.0 ms** | 21.5 ms | MoE CUDA kernel (FasterTransformer) — fastest |
| ONNX fp16 | — | 2.82 GB | fail | — | MoE FT kernel templated for SM80, fails on Ada (SM89) in ORT 1.25 |
| ONNX INT8 | 0.9508 | 1.60 GB | 68.7 ms | 89.4 ms | QMoE CUDA int8 kernel currently unoptimized |
| ONNX INT4 | 0.9506 | 1.35 GB | 347.1 ms | 426.2 ms | Same — QMoE CUDA int4 path unoptimized in 1.25 |
**GPU takeaway:** the ONNX **fp32** graph benefits from a highly optimized
FasterTransformer-based MoE kernel and reaches **5 ms / example (~200 ex/s)**
a 7.7× speed-up over CPU. The quantized (`QMoE`) CUDA kernels exist and run
correctly but are currently much slower than the fp32 kernel, so the quantized
variants are **not currently recommended for latency-critical GPU inference**.
Their value on GPU is memory footprint (1.3 – 1.6 GB of VRAM) rather than speed.
Future ORT releases, or using TensorRT-LLM / custom kernels, should close this gap.
The fp16 failure on Ada (RTX 4090, SM89) stems from the bundled CUTLASS MoE
GEMM being templated against SM80 — a shared-memory check rejects the kernel
at launch. An ORT build rebuilt with SM89 kernels, or running on A100/A10/H100,
should restore fp16 MoE support.
### Quantization details
- **fp16**: whole-graph float16 cast (`onnxconverter_common.float16.convert_float_to_float16`).
- **INT8** = `quantize_dynamic` (per-channel int8) on regular `MatMul` / `Gemm` nodes
**+** block-symmetric int8 QMoE on the expert tensors (`block_size=32`).
- **INT4** = `MatMulNBitsQuantizer` (4-bit weight-only) on regular `MatMul` / `Gemm`
nodes **+** block-symmetric int4 QMoE on the expert tensors (`block_size=32`,
symmetric, default zero-point 2^(bits-1)).
- Quantization script for the MoE part is included in the training repo as
`training/quantize_moe.py` — the stock ORT quantizers don't crack open the
custom `com.microsoft.MoE` op, so we manually block-quantize `gate_up_proj` and
`down_proj` per expert and rewrite the node to `com.microsoft.QMoE`.
### How to use the ONNX variants
```python
import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("YLOD/openai-privacy-filter-fr")
# Download one variant (fp16 recommended for smallest size with no quality loss):
# huggingface-cli download YLOD/openai-privacy-filter-fr onnx/model_fp16.onnx onnx/model_fp16.onnx_data
sess = ort.InferenceSession(
"onnx/model_fp16.onnx",
providers=["CPUExecutionProvider"], # or ["CUDAExecutionProvider"] on GPU
)
text = "Alice Dupont, IBAN FR76 3000 4000 0312 3456 7890 143, née le 1987-05-12."
enc = tok(text, return_tensors="np")
logits = sess.run(
["logits"],
{"input_ids": enc["input_ids"].astype(np.int64),
"attention_mask": enc["attention_mask"].astype(np.int64)},
)[0]
pred_ids = logits[0].argmax(-1)
id2label = {0: "O"} # (full 33-class map: see config.json)
# → merge adjacent BIOES subword tags into spans as per the base model card
```
For a ready-to-use span merger compatible with the model's subword-level BIOES outputs,
see the training repo that produced this model.
## License
Apache-2.0, inherited from [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter). The training data ([`ai4privacy/open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy)) is CC-BY-4.0 — attribution preserved above.
## Citation
If you use this model, please cite the underlying base model and dataset:
```bibtex
@misc{openai2026privacyfilter,
title = {OpenAI Privacy Filter},
author = {OpenAI},
year = {2026},
howpublished = {\url{https://huggingface.co/openai/privacy-filter}},
}
@dataset{ai4privacy_open_pii_500k,
title = {Open PII Masking 500k (ai4privacy)},
author = {ai4privacy},
howpublished = {\url{https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy}},
license = {CC-BY-4.0},
}
```