privacy-filter-india-v2

Fine-tune of openai/privacy-filter for Indian-PII + secrets detection in enterprise DLP.

What this is

A token-classifier targeting the 8 base categories from openai/privacy-filter (account_number, private_address, private_date, private_email, private_person, private_phone, private_url, secret), trained to recognise the long tail of Indian-specific identifiers and the secret formats the base model under-detected:

IFSC codes (Indian bank routing)
Indian credit cards including Visa/MC/Amex/JCB/Diners/RuPay BINs
Indian-domain emails (.in / .co.in)
Indian addresses in short form
UPI IDs across all 18 PSPs
AWS keys in natural-language contexts (my access key is AKIA...)
Database connection strings with embedded credentials
Indian passport, voter ID, GSTIN, PAN, Aadhaar, ABHA, bank account patterns

Eval (431-case Janus DLP corpus, base model + this fine-tune)

	Base `openai/privacy-filter`	This fine-tune
Overall F1	0.8557	0.9398
Precision	0.875	0.941
Recall	0.838	0.938
p50 latency (T4)	26 ms	27 ms

Largest gains over base

Category	Base F1	This F1
ifsc	0.545	0.938
aws_keys	0.667	0.968
db_connection_strings	0.737	1.000
credit_card	0.643	0.875
email	0.769	0.938
address_in	0.815	0.968
upi_id	0.800	0.968
gstin	0.875	1.000
voter_id	0.909	1.000
passport_in	0.968	1.000

Categories already strong in base (private_keys, multi_pii, iban, secrets_in_code, ssn, etc.) preserved.

Training

957 spans-format examples (770 train / 187 val) covering Indian PII templates + regression-protection negatives
3 epochs, batch 4, grad-accum 2, lr 1e-4, bf16 output
Trained on a single Tesla T4 in 78 seconds total
Best checkpoint: epoch 2 (val_loss 0.029, val_token_accuracy 98.83%)

Usage

from opf._api import OPF

opf = OPF(model="janus-ai/privacy-filter-india-v2", device="cuda")
result = opf.redact("my Aadhaar 234567890123 IFSC HDFC0001234 password Admin#123")
print(result.redacted_text)
# my Aadhaar <ACCOUNT_NUMBER> IFSC <ACCOUNT_NUMBER> password <SECRET>

Drop-in replacement for the base openai/privacy-filter checkpoint — same tokenizer, same architecture, same 8-label space.

Limitations

Trained primarily on English + Hinglish-context Indian PII. Non-Indian addresses/names rely on base-model coverage.
Date-of-birth detection (private_date) sits at F1 ~0.85 — context-gating could help.
18 false positives across 431 cases (4.4% FP rate).

License

Apache 2.0 — inherited from openai/privacy-filter.

Downloads last month: 10

Model tree for Vasanth155/privacy-filter-india-v2

Base model

openai/privacy-filter

Finetuned

(30)

this model