| --- |
| license: apache-2.0 |
| base_model: openai/privacy-filter |
| language: |
| - fr |
| library_name: transformers |
| pipeline_tag: token-classification |
| tags: |
| - pii |
| - privacy |
| - token-classification |
| - ner |
| - bioes |
| - french |
| - insurance |
| - crm |
| datasets: |
| - ai4privacy/open-pii-masking-500k-ai4privacy |
| metrics: |
| - f1 |
| - precision |
| - recall |
| widget: |
| - text: "Bonjour, je m'appelle Alice Dupont et mon email est alice@acme.fr" |
| - text: "Mon IBAN est FR76 3000 4000 0312 3456 7890 143 et mon téléphone le 06 12 34 56 78." |
| - text: "Le sinistre N°2024-FR-98341 concerne M. Jean-Baptiste Leclerc, né le 1987-05-12, au 15 rue de Rivoli 75001 Paris." |
| model-index: |
| - name: openai-privacy-filter-fr |
| results: |
| - task: |
| type: token-classification |
| name: PII span detection (French) |
| dataset: |
| name: ai4privacy/open-pii-masking-500k-ai4privacy (French slice, held-out test) |
| type: ai4privacy/open-pii-masking-500k-ai4privacy |
| config: fr |
| split: test |
| metrics: |
| - type: f1 |
| value: 0.9522 |
| name: Overall span-F1 (IOBES strict) |
| - type: precision |
| value: 0.9600 |
| name: Overall precision |
| - type: recall |
| value: 0.9446 |
| name: Overall recall |
| --- |
| |
| # openai-privacy-filter-fr |
|
|
| French fine-tune of [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) for PII detection in a **French insurance CRM** context (policyholder names, contact details, IBAN/RIB, addresses, dates of birth, claim / contract identifiers). |
|
|
| Full fine-tuning with AdamW 8-bit (bitsandbytes), bf16 autocast, on a single RTX 4090. |
|
|
| --- |
|
|
| ## Results (held-out French test set, 1 035 examples) |
|
|
| | Metric | Zero-shot baseline (`openai/privacy-filter`) | This model | Δ | |
| |---|---:|---:|---:| |
| | **Overall span-F1 (IOBES strict)** | 0.7068 | **0.9522** | **+0.2454 (+34.7 %)** | |
| | Precision | 0.8037 | 0.9600 | +0.1563 | |
| | Recall | 0.6308 | 0.9446 | +0.3138 | |
|
|
| ### Per-class F1 (span-level, strict IOBES) |
|
|
| | Class | Baseline | This model | Δ | |
| |---|---:|---:|---:| |
| | `private_email` | 0.960 | **1.000** | +0.040 | |
| | `private_phone` | 0.870 | **1.000** | +0.130 | |
| | `private_date` | 0.652 | **0.997** | +0.345 | |
| | `account_number` | 0.874 | **0.995** | +0.121 | |
| | `private_person` | 0.683 | 0.931 | +0.248 | |
| | `private_address` | 0.428 | 0.906 | **+0.478** | |
|
|
| *(`private_url` and `secret` classes are preserved from the base model but not present in this test set, so not reported.)* |
|
|
| --- |
|
|
| ## Intended use |
|
|
| Designed for **French-language on-premises PII redaction** in enterprise flows: emails, chat logs, CRM notes, claim reports, scanned document transcripts. Primary target: insurance back-office (souscripteurs, sinistres), but the label set is generic enough for banking, healthcare admin, HR, and customer support. |
|
|
| **Not suitable for:** |
| - Languages other than French (use the base model or retrain for your target language). |
| - Content with no training-time analogue (e.g. medical free-text, legal case citations). |
| - Final anonymisation guarantee — always combine with rule-based recognisers (Presidio) and human review for high-sensitivity workflows. |
|
|
| --- |
|
|
| ## Label schema |
|
|
| Same as the base model: 33 classes = `O` + 8 entity types × 4 BIOES boundary tags. |
|
|
| | Entity type | Covers | |
| |---|---| |
| | `private_person` | Policyholder names, usernames, titles (M., Mme., Dr.). | |
| | `private_email` | Personal email addresses. | |
| | `private_phone` | Phone numbers (mobile / landline / fax). | |
| | `private_address` | Street, building number, city, ZIP, state/country. | |
| | `account_number` | IBAN/RIB, credit card, BIC/SWIFT, customer/contract IDs, ID card, passport, tax and social numbers. | |
| | `private_date` | DOB, birth year, date/time references tied to a person. | |
| | `private_url` | Personal URLs / IP addresses. *(preserved from base model; not retrained)* | |
| | `secret` | API keys, passwords, tokens. *(preserved from base model; not retrained)* | |
|
|
| Inference returns subword-level BIOES tags that the HuggingFace `token-classification` pipeline aggregates into spans. |
|
|
| --- |
|
|
| ## How to use |
|
|
| ```python |
| from transformers import pipeline |
| |
| nlp = pipeline( |
| task="token-classification", |
| model="YLOD/openai-privacy-filter-fr", |
| aggregation_strategy="simple", |
| ) |
| |
| text = ( |
| "Bonjour, je suis Alice Dupont, née le 1987-05-12. " |
| "Mon email : alice.dupont@acme.fr, mobile 06 12 34 56 78. " |
| "IBAN : FR76 3000 4000 0312 3456 7890 143." |
| ) |
| for span in nlp(text): |
| print(f"[{span['entity_group']:>16}] {span['word']!r} ({span['score']:.3f})") |
| ``` |
|
|
| For a ready-to-use masker that merges adjacent subword spans correctly, see the [demo script in the GitHub repo](https://github.com/autoresearch-demo/privacy-filter-fr) (if published) or reuse the `merge_spans` helper from the training code. |
|
|
| --- |
|
|
| ## Training details |
|
|
| ### Data |
|
|
| - **Source**: [`ai4privacy/open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy) (CC-BY-4.0) |
| - **Language filter**: `language == "fr"` → 89 670 examples available; **10 005 / 460 / 1 035** train / validation / test (seed 42) |
| - **Label mapping**: 60+ source classes collapsed into the 8-class privacy-filter taxonomy (`FIRSTNAME / LASTNAME / GIVENNAME / SURNAME / TITLE → private_person`, `TELEPHONENUM / PHONEIMEI → private_phone`, `BUILDINGNUM / CITY / ZIPCODE / STREET → private_address`, `IBAN / IDCARDNUM / PASSPORTNUM / TAXNUM / SOCIALNUM → account_number`, etc.). |
| - **Alignment**: char-offset spans aligned to subword tokens with strict BIOES at the subword level (first / middle / last subwords of a span get `B-` / `I-` / `E-`, singletons get `S-`). Whitespace-only subwords inside a span inherit `I-` to bridge IBAN-like groups. |
|
|
| ### Hyperparameters (optimum found via 8-iteration autoresearch sweep) |
|
|
| | | Value | |
| |---|---| |
| | Base checkpoint | `openai/privacy-filter` (1.4 B params total, 50 M active — MoE) | |
| | Strategy | **Full fine-tuning** (all 1.4 B params trainable) | |
| | Optimizer | **AdamW 8-bit** (bitsandbytes) | |
| | Learning rate | **2 × 10⁻⁴** | |
| | Batch size | 16 × grad-accum 2 = effective 32 | |
| | Epochs | 2 | |
| | Warmup ratio | 0.03 | |
| | Weight decay | 0.01 | |
| | Max grad norm | 1.0 | |
| | Scheduler | cosine | |
| | Precision | bf16 autocast (fp32 master weights) | |
| | Gradient checkpointing | disabled (short sequences, ~30 tokens median) | |
| | Seed | 0 | |
|
|
| ### Hardware |
|
|
| - Single **NVIDIA RTX 4090 (24 GB)** capped at **225 W** |
| - WSL2 on Windows, PyTorch 2.11 + CUDA 13.0, transformers 5.6 |
| - ~25 minutes wall-clock per training run |
|
|
| ### Noise floor |
|
|
| Seed-to-seed variance on the same config ≈ **±0.003 F1** (measured with seeds 0 and 1). Gains smaller than that are not meaningful. |
|
|
| ### Autoresearch iteration summary |
|
|
| | # | Change | Test F1 | Δ vs baseline | Outcome | |
| |---|---|---:|---:|---| |
| | 0 | — (zero-shot baseline) | 0.7068 | — | baseline | |
| | 1 | LR 1e-5 → 3e-4 | 0.9473 | +0.2405 | keep | |
| | **2** | **LR 3e-4 → 2e-4** | **0.9522** | **+0.2454** | **keep (best)** | |
| | 3 | LR 2e-4 → 1e-4 | 0.9356 | +0.2288 | discard | |
| | 4 | Epochs 2 → 3 | 0.9500 | +0.2432 | discard | |
| | 5 | Warmup 0.03 → 0.10 | 0.9382 | +0.2314 | discard | |
| | 6 | Weight decay 0.01 → 0.1 | 0.9492 | +0.2424 | discard | |
| | 7 | Seed 0 → 1 (noise check) | 0.9491 | +0.2423 | discard | |
|
|
| --- |
|
|
| ## Limitations & ethical considerations |
|
|
| - **No privacy guarantee.** ML-based PII detection can miss uncommon formats, aliased references, adversarial spacing, or novel identifier types. This model should always be paired with regex-based recognisers and human review for high-sensitivity outputs. |
| - **French-only distribution shift.** Trained on French data only; performance on other languages will regress sharply from the base model baseline. |
| - **Synthetic data bias.** ai4privacy is largely template-generated. Real-world free-text (handwritten claim descriptions, casual customer emails) may be underrepresented. A domain-specific holdout from your actual CRM is essential before production deployment. |
| - **`address` is the weakest class (F1 0.91).** Ambiguous short addresses (single street name, abbreviations, PO boxes) are the main failure mode. |
| - **`private_url` and `secret` were not retrained.** Their behaviour is inherited from the base model; if these matter in your domain, run a follow-up fine-tune that includes them. |
| - **Label collisions.** When a token plausibly belongs to two classes (e.g. a phone number embedded in an address block), the model picks one; span splitting is not guaranteed to follow human intuition. |
| - **Not suitable for medical, legal or regulated decision-making** without explicit compliance review. |
| |
| ## ONNX quantized variants |
| |
| In addition to the default PyTorch `model.safetensors`, this repository ships four |
| ONNX variants under `onnx/`, benchmarked on the same French test set (1 035 examples). |
| The ONNX graph reuses OpenAI's base-model export (which correctly handles the MoE |
| routing and attention sinks) with the fine-tuned weights swapped in. INT8 and INT4 |
| variants combine standard `quantize_dynamic` / `MatMulNBitsQuantizer` on the MatMul |
| nodes **with** a custom `MoE → QMoE` conversion of the expert tensors (block-symmetric, |
| block_size=32), so the MoE experts — which hold ~90 % of the parameters — are also |
| quantized. |
| |
| | Variant | F1 | Δ F1 | Precision | Recall | File size | Compression | CPU latency p50 | CPU latency p95 | |
| |---|---:|---:|---:|---:|---:|---:|---:|---:| |
| | **PyTorch (transformers)** | 0.9522 | — | 0.9595 | 0.9451 | 2.80 GB | 1.0× | 99.5 ms | 144.8 ms | |
| | **ONNX fp32** (`onnx/model.onnx`) | 0.9522 | 0.0000 | 0.9595 | 0.9451 | 5.63 GB | 0.5× | **38.4 ms** | 53.0 ms | |
| | **ONNX fp16** (`onnx/model_fp16.onnx`) | 0.9517 | -0.0005 | 0.9584 | 0.9451 | 2.82 GB | 1.0× | 39.3 ms | 54.6 ms | |
| | **ONNX INT8** (`onnx/model_int8.onnx`) | 0.9516 | -0.0006 | 0.9605 | 0.9429 | **1.60 GB** | **3.5×** | 52.7 ms | 68.0 ms | |
| | **ONNX INT4** (`onnx/model_int4.onnx`) | 0.9509 | -0.0013 | 0.9573 | 0.9446 | **1.35 GB** | **4.2×** | 344.8 ms ⚠ | 428.0 ms | |
|
|
| *Benchmark setup: 1 035 FR test examples, batch size 1, single-threaded ONNX Runtime |
| (CPU provider) on AMD Ryzen-class CPU, WSL2. Compression is vs PyTorch fp32. Memory |
| readings via `ru_maxrss` (~9 GB across all variants because ORT mem-maps external-data |
| files instead of loading them fully — RSS doesn't reflect the actual resident set |
| for mmapped data).* |
|
|
| ### Key findings |
|
|
| - **ONNX fp32 is 2.6× faster than PyTorch** at identical precision and F1 (graph |
| optimisation, fused ops, no Python-side MoE loop). |
| - **INT8 is the practical sweet spot on CPU**: 3.5× smaller than PyTorch fp32 |
| (1.60 GB vs 2.80 GB original, 5.63 GB vs fp32 ONNX), F1 unchanged within the |
| training noise floor (±0.003), and still ~2× faster than PyTorch. |
| - **INT4 gives the smallest footprint** (1.35 GB, 4.2× compression) with a |
| negligible F1 loss, but the CPU `QMoE` kernel for int4 is not as optimized as |
| its int8 counterpart — **expect ~9× slowdown on CPU**. INT4 is best suited for |
| GPU inference or specialized runtimes (CUDA, OpenVINO, WebGPU via Transformers.js) |
| where the int4 dequant path is kernel-fused. |
| - All four variants are within the training noise floor (±0.003) on overall F1, |
| so pick based on the target runtime and memory budget. |
|
|
| ### GPU (CUDA) benchmark (RTX 4090, ONNX Runtime 1.25 CUDA EP) |
|
|
| | Variant | F1 | Size | CUDA latency p50 | CUDA latency p95 | Notes | |
| |---|---:|---:|---:|---:|---| |
| | **ONNX fp32** | 0.9522 | 5.63 GB | **5.0 ms** | 21.5 ms | MoE CUDA kernel (FasterTransformer) — fastest | |
| | ONNX fp16 | — | 2.82 GB | fail | — | MoE FT kernel templated for SM80, fails on Ada (SM89) in ORT 1.25 | |
| | ONNX INT8 | 0.9508 | 1.60 GB | 68.7 ms | 89.4 ms | QMoE CUDA int8 kernel currently unoptimized | |
| | ONNX INT4 | 0.9506 | 1.35 GB | 347.1 ms | 426.2 ms | Same — QMoE CUDA int4 path unoptimized in 1.25 | |
|
|
| **GPU takeaway:** the ONNX **fp32** graph benefits from a highly optimized |
| FasterTransformer-based MoE kernel and reaches **5 ms / example (~200 ex/s)** — |
| a 7.7× speed-up over CPU. The quantized (`QMoE`) CUDA kernels exist and run |
| correctly but are currently much slower than the fp32 kernel, so the quantized |
| variants are **not currently recommended for latency-critical GPU inference**. |
| Their value on GPU is memory footprint (1.3 – 1.6 GB of VRAM) rather than speed. |
| Future ORT releases, or using TensorRT-LLM / custom kernels, should close this gap. |
|
|
| The fp16 failure on Ada (RTX 4090, SM89) stems from the bundled CUTLASS MoE |
| GEMM being templated against SM80 — a shared-memory check rejects the kernel |
| at launch. An ORT build rebuilt with SM89 kernels, or running on A100/A10/H100, |
| should restore fp16 MoE support. |
|
|
| ### Quantization details |
|
|
| - **fp16**: whole-graph float16 cast (`onnxconverter_common.float16.convert_float_to_float16`). |
| - **INT8** = `quantize_dynamic` (per-channel int8) on regular `MatMul` / `Gemm` nodes |
| **+** block-symmetric int8 QMoE on the expert tensors (`block_size=32`). |
| - **INT4** = `MatMulNBitsQuantizer` (4-bit weight-only) on regular `MatMul` / `Gemm` |
| nodes **+** block-symmetric int4 QMoE on the expert tensors (`block_size=32`, |
| symmetric, default zero-point 2^(bits-1)). |
| - Quantization script for the MoE part is included in the training repo as |
| `training/quantize_moe.py` — the stock ORT quantizers don't crack open the |
| custom `com.microsoft.MoE` op, so we manually block-quantize `gate_up_proj` and |
| `down_proj` per expert and rewrite the node to `com.microsoft.QMoE`. |
|
|
| ### How to use the ONNX variants |
|
|
| ```python |
| import numpy as np |
| import onnxruntime as ort |
| from transformers import AutoTokenizer |
| |
| tok = AutoTokenizer.from_pretrained("YLOD/openai-privacy-filter-fr") |
| # Download one variant (fp16 recommended for smallest size with no quality loss): |
| # huggingface-cli download YLOD/openai-privacy-filter-fr onnx/model_fp16.onnx onnx/model_fp16.onnx_data |
| |
| sess = ort.InferenceSession( |
| "onnx/model_fp16.onnx", |
| providers=["CPUExecutionProvider"], # or ["CUDAExecutionProvider"] on GPU |
| ) |
| |
| text = "Alice Dupont, IBAN FR76 3000 4000 0312 3456 7890 143, née le 1987-05-12." |
| enc = tok(text, return_tensors="np") |
| logits = sess.run( |
| ["logits"], |
| {"input_ids": enc["input_ids"].astype(np.int64), |
| "attention_mask": enc["attention_mask"].astype(np.int64)}, |
| )[0] |
| |
| pred_ids = logits[0].argmax(-1) |
| id2label = {0: "O"} # (full 33-class map: see config.json) |
| # → merge adjacent BIOES subword tags into spans as per the base model card |
| ``` |
|
|
| For a ready-to-use span merger compatible with the model's subword-level BIOES outputs, |
| see the training repo that produced this model. |
|
|
| ## License |
|
|
| Apache-2.0, inherited from [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter). The training data ([`ai4privacy/open-pii-masking-500k-ai4privacy`](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy)) is CC-BY-4.0 — attribution preserved above. |
|
|
| ## Citation |
|
|
| If you use this model, please cite the underlying base model and dataset: |
|
|
| ```bibtex |
| @misc{openai2026privacyfilter, |
| title = {OpenAI Privacy Filter}, |
| author = {OpenAI}, |
| year = {2026}, |
| howpublished = {\url{https://huggingface.co/openai/privacy-filter}}, |
| } |
| |
| @dataset{ai4privacy_open_pii_500k, |
| title = {Open PII Masking 500k (ai4privacy)}, |
| author = {ai4privacy}, |
| howpublished = {\url{https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy}}, |
| license = {CC-BY-4.0}, |
| } |
| ``` |
|
|