privacy-filter-india-v5
Fine-tune of openai/privacy-filter for DPDP-grade Indian PII + secrets detection. Adds 10 distinct categories on top of the base 8 so DLP audit reports can distinguish Aadhaar from PAN from GSTIN, etc.
Label space β 18 categories (base 8 + 10 Indian-specific)
| Inherited from base | New in v5 |
|---|---|
| account_number, private_address, private_date, private_email, private_person, private_phone, private_url, secret | aadhaar, pan, gstin, ifsc, upi_id, abha_id, voter_id, indian_passport, credit_card, bank_account_in |
Eval (431-case Janus DLP corpus)
| Base openai/privacy-filter | This model (v5) | |
|---|---|---|
| Overall F1 | 0.8557 | 0.9589 |
| Precision | 0.875 | 0.970 |
| Recall | 0.838 | 0.948 |
| p50 latency (T4) | 26 ms | 27 ms |
| FP rate (negatives) | 30% | 7% |
11 categories at perfect 1.000 F1
gstin, passport_in, voter_id, address_in, multi_pii, abha_id, iban, ssn, private_keys, passwords, secrets_in_code
Largest improvements over base
| Category | Base F1 | v5 F1 |
|---|---|---|
| passwords | 0.000 | 1.000 |
| abha_id | 0.000 | 1.000 |
| ifsc | 0.545 | 0.966 |
| credit_card | 0.643 | 0.933 |
| aws_keys | 0.667 | 0.966 |
| 0.769 | 0.938 | |
| address_in | 0.815 | 1.000 |
| upi_id | 0.800 | 0.933 |
| gstin | 0.875 | 1.000 |
Training
- 5-round iterative fine-tune (v1 β v5)
- 1797 training examples (1122 multi-entity Indian PII + 545 disambiguation/regression-protection + 130 targeted bank/password/AWS/UPI fixes)
- 198 validation examples (capped to 50 in training for memory)
- 3 epochs each round, batch 2 + grad-accum 4, lr 1e-4, bf16 weights
- T4 GPU, ~80 sec per round
- Final epoch best: val_loss 0.019, val_token_accuracy 99.39%
Usage
from opf._api import OPF
opf = OPF(model="Vasanth155/privacy-filter-india-v5", device="cuda")
text = "customer Rohan Sharma, Aadhaar 234567890123, mobile 9876543210, PAN ABCDE1234F"
result = opf.redact(text)
print(result.redacted_text)
# customer <PRIVATE_PERSON>, Aadhaar <AADHAAR>, mobile <PRIVATE_PHONE>, PAN <PAN>
Drop-in for the base openai/privacy-filter model β same architecture, same MoE routing, same opf API.
Limitations
- Trained primarily on English + Hinglish-context Indian PII. Non-Indian addresses fall back to base coverage.
- 16 false negatives across 431 cases (3.7% miss rate). Most are rare formats (sk-ant-example, AWS IOSFODNN canonical sample, edge UPI PSPs).
- DPDP-grade label distinction works for the 10 added categories. CVV, ABHA-Luhn validation, PCI-DSS-3.4 redaction policy live in the gateway layer, not the model.
License
Apache 2.0 β inherited from openai/privacy-filter.
- Downloads last month
- 43
Model tree for Vasanth155/privacy-filter-india-v5
Base model
openai/privacy-filter