OPF Russian PII 66K — Fine-tuned OpenAI Privacy Filter
Fine-tuned checkpoint of openai/privacy-filter for Russian PII detection and anonymization tasks.
This model is intended for detecting personal and sensitive information in Russian fintech-style text, with future extension toward Kazakh and mixed Russian/Kazakh call-center transcripts.
Model Details
- Base model:
openai/privacy-filter - Task: PII detection / token classification / anonymization support
- Domain: Russian fintech-style privacy filtering
- Training dataset:
bbeglerov/russian-pi-66k-opf - Checkpoint format:
safetensors - Output: OPF-compatible span predictions
Intended Use
This checkpoint is intended for:
- detecting PII in Russian text;
- privacy filtering before LLM/RAG processing;
- anonymization pipelines for fintech, customer-support, and call-center transcripts;
- research on multilingual PII detection for Russian and Kazakh.
Out-of-Scope Use
This model should not be used as the only privacy/security control in production without additional validation.
It may fail on:
- noisy ASR transcripts;
- Kazakh-only text;
- mixed Russian/Kazakh/English conversations;
- unseen document formats;
- adversarial or deliberately obfuscated PII;
- domains far from the training distribution.
For production use, combine the model with rule-based validators, logging, red-team tests, and human review for high-risk workflows.
Labels
The model was trained with a custom OPF-compatible label space.
Main labels include:
private_personprivate_phoneprivate_emailprivate_addressprivate_datesocial_numberaccount_numberid_card_numberdriver_license_numbertax_numberusernamesecret
Usage
Install OpenAI Privacy Filter from the official repository:
git clone https://github.com/openai/privacy-filter.git
cd privacy-filter
pip install -e .
Download this checkpoint:
hf download YOUR_USERNAME/YOUR_MODEL_REPO \
--local-dir ./opf-russian-pii-66k
Run inference:
opf --checkpoint ./opf-russian-pii-66k \
"Меня зовут Ержан Ахметов, мой ИИН 990101300123, телефон +7 777 123 45 67, адрес Алматы, Абая 10."
Training
Training was performed with OPF fine-tuning on bbeglerov/russian-pi-66k-opf.
Approximate training command:
opf train /workspace/data/russian-pi-66k-opf/data/train.jsonl \
--label-space-json /workspace/data/russian-pi-66k-opf/label_space.json \
--checkpoint /workspace/models/openai-privacy-filter \
--output-dir /workspace/outputs/opf-russian-pii-66k \
--epochs 1 \
--batch-size 8 \
--grad-accum-steps 2 \
--learning-rate 1e-5 \
--weight-decay 0.01 \
--max-grad-norm 1.0 \
--n-ctx 128 \
--output-param-dtype bf16 \
--device cuda
Evaluation
Evaluation was performed on the validation split of bbeglerov/russian-pi-66k-opf.
Typed Detection Metrics
| Metric | Value |
|---|---|
| Precision | 0.9990 |
| Recall | 0.9988 |
| F1 | 0.9989 |
| Span Precision | 0.9961 |
| Span Recall | 0.9961 |
| Span F1 | 0.9961 |
| Token Accuracy | 0.9984 |
| Loss | 0.0072 |
Per-Class Span F1
| Label | F1 |
|---|---|
account_number |
0.9988 |
private_address |
0.9959 |
private_date |
1.0000 |
private_email |
0.9991 |
private_person |
0.9929 |
private_phone |
1.0000 |
secret |
0.9968 |
username |
0.9948 |
tax_number |
0.9982 |
id_card_number |
0.9912 |
social_number |
1.0000 |
driver_license_number |
0.9851 |
Important Evaluation Caveat
These results are measured on the validation split from the same dataset distribution used for fine-tuning.
High validation scores do not guarantee production performance on real call-center transcripts, noisy ASR output, Kazakh text, or mixed-language conversations.
Recommended additional evaluation:
- negative set with no PII;
- noisy ASR-style text;
- mixed Russian/Kazakh/English text;
- real manually reviewed fintech transcripts;
- comparison against the base
openai/privacy-filtercheckpoint.
Limitations
The model may overfit to the structure and formatting of bbeglerov/russian-pi-66k-opf.
Known risk areas:
- false positives on short person-like tokens;
- reduced robustness on noisy transcription;
- limited Kazakh coverage;
- possible label confusion between numeric identifiers;
- domain shift outside synthetic or converted PII datasets.
Roadmap
Planned next versions:
v2: mixed Russian + Kazakh PII dataset;- Kazakh-specific labels and examples;
- ASR-noisy call-center evaluation set;
- hard-negative benchmark;
- comparison against base OPF and rule-based baselines.
Citation / Attribution
Base model:
openai/privacy-filter
Training dataset:
bbeglerov/russian-pi-66k-opf
Upstream source dataset:
wolframko/russian-pii-66k
Please check upstream dataset license and attribution requirements before using this model in public or commercial settings.
- Downloads last month
- 19
Model tree for bbeglerov/opf-russian-pii-66k
Base model
openai/privacy-filter