metadata
license: apache-2.0
base_model: openai/privacy-filter
tags:
- token-classification
- pii-detection
- secrets-detection
- dlp
- india
- dpdp
language:
- en
- hi
metrics:
- f1
- precision
- recall
privacy-filter-india-v2
Fine-tune of openai/privacy-filter for Indian-PII + secrets detection in enterprise DLP.
What this is
A token-classifier targeting the 8 base categories from openai/privacy-filter (account_number, private_address, private_date, private_email, private_person, private_phone, private_url, secret), trained to recognise the long tail of Indian-specific identifiers and the secret formats the base model under-detected:
- IFSC codes (Indian bank routing)
- Indian credit cards including Visa/MC/Amex/JCB/Diners/RuPay BINs
- Indian-domain emails (.in / .co.in)
- Indian addresses in short form
- UPI IDs across all 18 PSPs
- AWS keys in natural-language contexts (
my access key is AKIA...) - Database connection strings with embedded credentials
- Indian passport, voter ID, GSTIN, PAN, Aadhaar, ABHA, bank account patterns
Eval (431-case Janus DLP corpus, base model + this fine-tune)
Base openai/privacy-filter |
This fine-tune | |
|---|---|---|
| Overall F1 | 0.8557 | 0.9398 |
| Precision | 0.875 | 0.941 |
| Recall | 0.838 | 0.938 |
| p50 latency (T4) | 26 ms | 27 ms |
Largest gains over base
| Category | Base F1 | This F1 |
|---|---|---|
| ifsc | 0.545 | 0.938 |
| aws_keys | 0.667 | 0.968 |
| db_connection_strings | 0.737 | 1.000 |
| credit_card | 0.643 | 0.875 |
| 0.769 | 0.938 | |
| address_in | 0.815 | 0.968 |
| upi_id | 0.800 | 0.968 |
| gstin | 0.875 | 1.000 |
| voter_id | 0.909 | 1.000 |
| passport_in | 0.968 | 1.000 |
Categories already strong in base (private_keys, multi_pii, iban, secrets_in_code, ssn, etc.) preserved.
Training
- 957 spans-format examples (770 train / 187 val) covering Indian PII templates + regression-protection negatives
- 3 epochs, batch 4, grad-accum 2, lr 1e-4, bf16 output
- Trained on a single Tesla T4 in 78 seconds total
- Best checkpoint: epoch 2 (val_loss 0.029, val_token_accuracy 98.83%)
Usage
from opf._api import OPF
opf = OPF(model="janus-ai/privacy-filter-india-v2", device="cuda")
result = opf.redact("my Aadhaar 234567890123 IFSC HDFC0001234 password Admin#123")
print(result.redacted_text)
# my Aadhaar <ACCOUNT_NUMBER> IFSC <ACCOUNT_NUMBER> password <SECRET>
Drop-in replacement for the base openai/privacy-filter checkpoint — same tokenizer, same architecture, same 8-label space.
Limitations
- Trained primarily on English + Hinglish-context Indian PII. Non-Indian addresses/names rely on base-model coverage.
- Date-of-birth detection (
private_date) sits at F1 ~0.85 — context-gating could help. - 18 false positives across 431 cases (4.4% FP rate).
License
Apache 2.0 — inherited from openai/privacy-filter.