privacy-filter-india-v2

Fine-tune of openai/privacy-filter for Indian-PII + secrets detection in enterprise DLP.

What this is

A token-classifier targeting the 8 base categories from openai/privacy-filter (account_number, private_address, private_date, private_email, private_person, private_phone, private_url, secret), trained to recognise the long tail of Indian-specific identifiers and the secret formats the base model under-detected:

  • IFSC codes (Indian bank routing)
  • Indian credit cards including Visa/MC/Amex/JCB/Diners/RuPay BINs
  • Indian-domain emails (.in / .co.in)
  • Indian addresses in short form
  • UPI IDs across all 18 PSPs
  • AWS keys in natural-language contexts (my access key is AKIA...)
  • Database connection strings with embedded credentials
  • Indian passport, voter ID, GSTIN, PAN, Aadhaar, ABHA, bank account patterns

Eval (431-case Janus DLP corpus, base model + this fine-tune)

Base openai/privacy-filter This fine-tune
Overall F1 0.8557 0.9398
Precision 0.875 0.941
Recall 0.838 0.938
p50 latency (T4) 26 ms 27 ms

Largest gains over base

Category Base F1 This F1
ifsc 0.545 0.938
aws_keys 0.667 0.968
db_connection_strings 0.737 1.000
credit_card 0.643 0.875
email 0.769 0.938
address_in 0.815 0.968
upi_id 0.800 0.968
gstin 0.875 1.000
voter_id 0.909 1.000
passport_in 0.968 1.000

Categories already strong in base (private_keys, multi_pii, iban, secrets_in_code, ssn, etc.) preserved.

Training

  • 957 spans-format examples (770 train / 187 val) covering Indian PII templates + regression-protection negatives
  • 3 epochs, batch 4, grad-accum 2, lr 1e-4, bf16 output
  • Trained on a single Tesla T4 in 78 seconds total
  • Best checkpoint: epoch 2 (val_loss 0.029, val_token_accuracy 98.83%)

Usage

from opf._api import OPF

opf = OPF(model="janus-ai/privacy-filter-india-v2", device="cuda")
result = opf.redact("my Aadhaar 234567890123 IFSC HDFC0001234 password Admin#123")
print(result.redacted_text)
# my Aadhaar <ACCOUNT_NUMBER> IFSC <ACCOUNT_NUMBER> password <SECRET>

Drop-in replacement for the base openai/privacy-filter checkpoint โ€” same tokenizer, same architecture, same 8-label space.

Limitations

  • Trained primarily on English + Hinglish-context Indian PII. Non-Indian addresses/names rely on base-model coverage.
  • Date-of-birth detection (private_date) sits at F1 ~0.85 โ€” context-gating could help.
  • 18 false positives across 431 cases (4.4% FP rate).

License

Apache 2.0 โ€” inherited from openai/privacy-filter.

Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Vasanth155/privacy-filter-india-v2

Finetuned
(30)
this model