Initial push: janus-ai privacy-filter India v2 (F1=0.9398 vs base 0.8557)

728a814 verified 11 days ago

3.05 kB

	---
	license: apache-2.0
	base_model: openai/privacy-filter
	tags:
	- token-classification
	- pii-detection
	- secrets-detection
	- dlp
	- india
	- dpdp
	language:
	- en
	- hi
	metrics:
	- f1
	- precision
	- recall
	---

	# privacy-filter-india-v2

	Fine-tune of [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) for Indian-PII + secrets detection in enterprise DLP.

	## What this is

	A token-classifier targeting the 8 base categories from `openai/privacy-filter` (`account_number`, `private_address`, `private_date`, `private_email`, `private_person`, `private_phone`, `private_url`, `secret`), trained to recognise the long tail of Indian-specific identifiers and the secret formats the base model under-detected:

	- IFSC codes (Indian bank routing)
	- Indian credit cards including Visa/MC/Amex/JCB/Diners/RuPay BINs
	- Indian-domain emails (.in / .co.in)
	- Indian addresses in short form
	- UPI IDs across all 18 PSPs
	- AWS keys in natural-language contexts (`my access key is AKIA...`)
	- Database connection strings with embedded credentials
	- Indian passport, voter ID, GSTIN, PAN, Aadhaar, ABHA, bank account patterns

	## Eval (431-case Janus DLP corpus, base model + this fine-tune)

	\| \| Base `openai/privacy-filter` \| This fine-tune \|
	\|---\|---\|---\|
	\| Overall F1 \| 0.8557 \| 0.9398 \|
	\| Precision \| 0.875 \| 0.941 \|
	\| Recall \| 0.838 \| 0.938 \|
	\| p50 latency (T4) \| 26 ms \| 27 ms \|

	### Largest gains over base

	\| Category \| Base F1 \| This F1 \|
	\|---\|---\|---\|
	\| ifsc \| 0.545 \| 0.938 \|
	\| aws_keys \| 0.667 \| 0.968 \|
	\| db_connection_strings \| 0.737 \| 1.000 \|
	\| credit_card \| 0.643 \| 0.875 \|
	\| email \| 0.769 \| 0.938 \|
	\| address_in \| 0.815 \| 0.968 \|
	\| upi_id \| 0.800 \| 0.968 \|
	\| gstin \| 0.875 \| 1.000 \|
	\| voter_id \| 0.909 \| 1.000 \|
	\| passport_in \| 0.968 \| 1.000 \|

	Categories already strong in base (private_keys, multi_pii, iban, secrets_in_code, ssn, etc.) preserved.

	## Training

	- 957 spans-format examples (770 train / 187 val) covering Indian PII templates + regression-protection negatives
	- 3 epochs, batch 4, grad-accum 2, lr 1e-4, bf16 output
	- Trained on a single Tesla T4 in 78 seconds total
	- Best checkpoint: epoch 2 (val_loss 0.029, val_token_accuracy 98.83%)

	## Usage

	```python
	from opf._api import OPF

	opf = OPF(model="janus-ai/privacy-filter-india-v2", device="cuda")
	result = opf.redact("my Aadhaar 234567890123 IFSC HDFC0001234 password Admin#123")
	print(result.redacted_text)
	# my Aadhaar <ACCOUNT_NUMBER> IFSC <ACCOUNT_NUMBER> password <SECRET>
	```

	Drop-in replacement for the base `openai/privacy-filter` checkpoint — same tokenizer, same architecture, same 8-label space.

	## Limitations

	- Trained primarily on English + Hinglish-context Indian PII. Non-Indian addresses/names rely on base-model coverage.
	- Date-of-birth detection (`private_date`) sits at F1 ~0.85 — context-gating could help.
	- 18 false positives across 431 cases (4.4% FP rate).

	## License

	Apache 2.0 — inherited from `openai/privacy-filter`.