--- license: apache-2.0 base_model: openai/privacy-filter tags: - token-classification - pii-detection - secrets-detection - dlp - india - dpdp language: - en - hi metrics: - f1 - precision - recall --- # privacy-filter-india-v5 Fine-tune of [openai/privacy-filter](https://huggingface.co/openai/privacy-filter) for **DPDP-grade Indian PII + secrets detection**. Adds 10 distinct categories on top of the base 8 so DLP audit reports can distinguish Aadhaar from PAN from GSTIN, etc. ## Label space — 18 categories (base 8 + 10 Indian-specific) | Inherited from base | New in v5 | |---|---| | account_number, private_address, private_date, private_email, private_person, private_phone, private_url, secret | aadhaar, pan, gstin, ifsc, upi_id, abha_id, voter_id, indian_passport, credit_card, bank_account_in | ## Eval (431-case Janus DLP corpus) | | Base openai/privacy-filter | This model (v5) | |---|---|---| | Overall F1 | 0.8557 | **0.9589** | | Precision | 0.875 | 0.970 | | Recall | 0.838 | 0.948 | | p50 latency (T4) | 26 ms | 27 ms | | FP rate (negatives) | 30% | 7% | ### 11 categories at perfect 1.000 F1 gstin, passport_in, voter_id, address_in, multi_pii, abha_id, iban, ssn, private_keys, passwords, secrets_in_code ### Largest improvements over base | Category | Base F1 | v5 F1 | |---|---|---| | passwords | 0.000 | **1.000** | | abha_id | 0.000 | **1.000** | | ifsc | 0.545 | 0.966 | | credit_card | 0.643 | 0.933 | | aws_keys | 0.667 | 0.966 | | email | 0.769 | 0.938 | | address_in | 0.815 | 1.000 | | upi_id | 0.800 | 0.933 | | gstin | 0.875 | 1.000 | ## Training - 5-round iterative fine-tune (v1 → v5) - 1797 training examples (1122 multi-entity Indian PII + 545 disambiguation/regression-protection + 130 targeted bank/password/AWS/UPI fixes) - 198 validation examples (capped to 50 in training for memory) - 3 epochs each round, batch 2 + grad-accum 4, lr 1e-4, bf16 weights - T4 GPU, ~80 sec per round - Final epoch best: val_loss 0.019, val_token_accuracy 99.39% ## Usage ```python from opf._api import OPF opf = OPF(model="Vasanth155/privacy-filter-india-v5", device="cuda") text = "customer Rohan Sharma, Aadhaar 234567890123, mobile 9876543210, PAN ABCDE1234F" result = opf.redact(text) print(result.redacted_text) # customer , Aadhaar , mobile , PAN ``` Drop-in for the base `openai/privacy-filter` model — same architecture, same MoE routing, same opf API. ## Limitations - Trained primarily on English + Hinglish-context Indian PII. Non-Indian addresses fall back to base coverage. - 16 false negatives across 431 cases (3.7% miss rate). Most are rare formats (sk-ant-example, AWS IOSFODNN canonical sample, edge UPI PSPs). - DPDP-grade label distinction works for the 10 added categories. CVV, ABHA-Luhn validation, PCI-DSS-3.4 redaction policy live in the gateway layer, not the model. ## License Apache 2.0 — inherited from `openai/privacy-filter`.