Vasanth155's picture
Initial push: janus-ai privacy-filter India v2 (F1=0.9398 vs base 0.8557)
728a814 verified
---
license: apache-2.0
base_model: openai/privacy-filter
tags:
- token-classification
- pii-detection
- secrets-detection
- dlp
- india
- dpdp
language:
- en
- hi
metrics:
- f1
- precision
- recall
---
# privacy-filter-india-v2
Fine-tune of [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) for Indian-PII + secrets detection in enterprise DLP.
## What this is
A token-classifier targeting the **8 base categories** from `openai/privacy-filter` (`account_number`, `private_address`, `private_date`, `private_email`, `private_person`, `private_phone`, `private_url`, `secret`), trained to recognise the long tail of Indian-specific identifiers and the secret formats the base model under-detected:
- **IFSC codes** (Indian bank routing)
- **Indian credit cards** including Visa/MC/Amex/JCB/Diners/RuPay BINs
- **Indian-domain emails** (.in / .co.in)
- **Indian addresses** in short form
- **UPI IDs** across all 18 PSPs
- **AWS keys** in natural-language contexts (`my access key is AKIA...`)
- **Database connection strings** with embedded credentials
- **Indian passport, voter ID, GSTIN, PAN, Aadhaar, ABHA, bank account** patterns
## Eval (431-case Janus DLP corpus, base model + this fine-tune)
| | Base `openai/privacy-filter` | This fine-tune |
|---|---|---|
| Overall F1 | 0.8557 | **0.9398** |
| Precision | 0.875 | 0.941 |
| Recall | 0.838 | 0.938 |
| p50 latency (T4) | 26 ms | 27 ms |
### Largest gains over base
| Category | Base F1 | This F1 |
|---|---|---|
| ifsc | 0.545 | **0.938** |
| aws_keys | 0.667 | **0.968** |
| db_connection_strings | 0.737 | **1.000** |
| credit_card | 0.643 | **0.875** |
| email | 0.769 | **0.938** |
| address_in | 0.815 | **0.968** |
| upi_id | 0.800 | **0.968** |
| gstin | 0.875 | **1.000** |
| voter_id | 0.909 | **1.000** |
| passport_in | 0.968 | **1.000** |
Categories already strong in base (private_keys, multi_pii, iban, secrets_in_code, ssn, etc.) preserved.
## Training
- 957 spans-format examples (770 train / 187 val) covering Indian PII templates + regression-protection negatives
- 3 epochs, batch 4, grad-accum 2, lr 1e-4, bf16 output
- Trained on a single Tesla T4 in 78 seconds total
- Best checkpoint: epoch 2 (val_loss 0.029, val_token_accuracy 98.83%)
## Usage
```python
from opf._api import OPF
opf = OPF(model="janus-ai/privacy-filter-india-v2", device="cuda")
result = opf.redact("my Aadhaar 234567890123 IFSC HDFC0001234 password Admin#123")
print(result.redacted_text)
# my Aadhaar <ACCOUNT_NUMBER> IFSC <ACCOUNT_NUMBER> password <SECRET>
```
Drop-in replacement for the base `openai/privacy-filter` checkpoint — same tokenizer, same architecture, same 8-label space.
## Limitations
- Trained primarily on English + Hinglish-context Indian PII. Non-Indian addresses/names rely on base-model coverage.
- Date-of-birth detection (`private_date`) sits at F1 ~0.85 — context-gating could help.
- 18 false positives across 431 cases (4.4% FP rate).
## License
Apache 2.0 — inherited from `openai/privacy-filter`.