---
license: apache-2.0
base_model: openai/privacy-filter
tags:
  - token-classification
  - pii-detection
  - secrets-detection
  - dlp
  - india
  - dpdp
language:
  - en
  - hi
metrics:
  - f1
  - precision
  - recall
---

# privacy-filter-india-v5

Fine-tune of [openai/privacy-filter](https://huggingface.co/openai/privacy-filter) for **DPDP-grade Indian PII + secrets detection**. Adds 10 distinct categories on top of the base 8 so DLP audit reports can distinguish Aadhaar from PAN from GSTIN, etc.

## Label space — 18 categories (base 8 + 10 Indian-specific)

| Inherited from base | New in v5 |
|---|---|
| account_number, private_address, private_date, private_email, private_person, private_phone, private_url, secret | aadhaar, pan, gstin, ifsc, upi_id, abha_id, voter_id, indian_passport, credit_card, bank_account_in |

## Eval (431-case Janus DLP corpus)

| | Base openai/privacy-filter | This model (v5) |
|---|---|---|
| Overall F1 | 0.8557 | **0.9589** |
| Precision | 0.875 | 0.970 |
| Recall | 0.838 | 0.948 |
| p50 latency (T4) | 26 ms | 27 ms |
| FP rate (negatives) | 30% | 7% |

### 11 categories at perfect 1.000 F1
gstin, passport_in, voter_id, address_in, multi_pii, abha_id, iban, ssn, private_keys, passwords, secrets_in_code

### Largest improvements over base

| Category | Base F1 | v5 F1 |
|---|---|---|
| passwords | 0.000 | **1.000** |
| abha_id | 0.000 | **1.000** |
| ifsc | 0.545 | 0.966 |
| credit_card | 0.643 | 0.933 |
| aws_keys | 0.667 | 0.966 |
| email | 0.769 | 0.938 |
| address_in | 0.815 | 1.000 |
| upi_id | 0.800 | 0.933 |
| gstin | 0.875 | 1.000 |

## Training

- 5-round iterative fine-tune (v1 → v5)
- 1797 training examples (1122 multi-entity Indian PII + 545 disambiguation/regression-protection + 130 targeted bank/password/AWS/UPI fixes)
- 198 validation examples (capped to 50 in training for memory)
- 3 epochs each round, batch 2 + grad-accum 4, lr 1e-4, bf16 weights
- T4 GPU, ~80 sec per round
- Final epoch best: val_loss 0.019, val_token_accuracy 99.39%

## Usage

```python
from opf._api import OPF

opf = OPF(model="Vasanth155/privacy-filter-india-v5", device="cuda")
text = "customer Rohan Sharma, Aadhaar 234567890123, mobile 9876543210, PAN ABCDE1234F"
result = opf.redact(text)
print(result.redacted_text)
# customer <PRIVATE_PERSON>, Aadhaar <AADHAAR>, mobile <PRIVATE_PHONE>, PAN <PAN>
```

Drop-in for the base `openai/privacy-filter` model — same architecture, same MoE routing, same opf API.

## Limitations

- Trained primarily on English + Hinglish-context Indian PII. Non-Indian addresses fall back to base coverage.
- 16 false negatives across 431 cases (3.7% miss rate). Most are rare formats (sk-ant-example, AWS IOSFODNN canonical sample, edge UPI PSPs).
- DPDP-grade label distinction works for the 10 added categories. CVV, ABHA-Luhn validation, PCI-DSS-3.4 redaction policy live in the gateway layer, not the model.

## License

Apache 2.0 — inherited from `openai/privacy-filter`.