--- license: apache-2.0 base_model: openai/privacy-filter tags: - token-classification - pii-detection - secrets-detection - dlp - india - dpdp language: - en - hi metrics: - f1 - precision - recall --- # privacy-filter-india-v2 Fine-tune of [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) for Indian-PII + secrets detection in enterprise DLP. ## What this is A token-classifier targeting the **8 base categories** from `openai/privacy-filter` (`account_number`, `private_address`, `private_date`, `private_email`, `private_person`, `private_phone`, `private_url`, `secret`), trained to recognise the long tail of Indian-specific identifiers and the secret formats the base model under-detected: - **IFSC codes** (Indian bank routing) - **Indian credit cards** including Visa/MC/Amex/JCB/Diners/RuPay BINs - **Indian-domain emails** (.in / .co.in) - **Indian addresses** in short form - **UPI IDs** across all 18 PSPs - **AWS keys** in natural-language contexts (`my access key is AKIA...`) - **Database connection strings** with embedded credentials - **Indian passport, voter ID, GSTIN, PAN, Aadhaar, ABHA, bank account** patterns ## Eval (431-case Janus DLP corpus, base model + this fine-tune) | | Base `openai/privacy-filter` | This fine-tune | |---|---|---| | Overall F1 | 0.8557 | **0.9398** | | Precision | 0.875 | 0.941 | | Recall | 0.838 | 0.938 | | p50 latency (T4) | 26 ms | 27 ms | ### Largest gains over base | Category | Base F1 | This F1 | |---|---|---| | ifsc | 0.545 | **0.938** | | aws_keys | 0.667 | **0.968** | | db_connection_strings | 0.737 | **1.000** | | credit_card | 0.643 | **0.875** | | email | 0.769 | **0.938** | | address_in | 0.815 | **0.968** | | upi_id | 0.800 | **0.968** | | gstin | 0.875 | **1.000** | | voter_id | 0.909 | **1.000** | | passport_in | 0.968 | **1.000** | Categories already strong in base (private_keys, multi_pii, iban, secrets_in_code, ssn, etc.) preserved. ## Training - 957 spans-format examples (770 train / 187 val) covering Indian PII templates + regression-protection negatives - 3 epochs, batch 4, grad-accum 2, lr 1e-4, bf16 output - Trained on a single Tesla T4 in 78 seconds total - Best checkpoint: epoch 2 (val_loss 0.029, val_token_accuracy 98.83%) ## Usage ```python from opf._api import OPF opf = OPF(model="janus-ai/privacy-filter-india-v2", device="cuda") result = opf.redact("my Aadhaar 234567890123 IFSC HDFC0001234 password Admin#123") print(result.redacted_text) # my Aadhaar IFSC password ``` Drop-in replacement for the base `openai/privacy-filter` checkpoint — same tokenizer, same architecture, same 8-label space. ## Limitations - Trained primarily on English + Hinglish-context Indian PII. Non-Indian addresses/names rely on base-model coverage. - Date-of-birth detection (`private_date`) sits at F1 ~0.85 — context-gating could help. - 18 false positives across 431 cases (4.4% FP rate). ## License Apache 2.0 — inherited from `openai/privacy-filter`.