OPF Russian PII 66K — Fine-tuned OpenAI Privacy Filter

Fine-tuned checkpoint of openai/privacy-filter for Russian PII detection and anonymization tasks.

This model is intended for detecting personal and sensitive information in Russian fintech-style text, with future extension toward Kazakh and mixed Russian/Kazakh call-center transcripts.

Model Details

Base model: openai/privacy-filter
Task: PII detection / token classification / anonymization support
Domain: Russian fintech-style privacy filtering
Training dataset: bbeglerov/russian-pi-66k-opf
Checkpoint format: safetensors
Output: OPF-compatible span predictions

Intended Use

This checkpoint is intended for:

detecting PII in Russian text;
privacy filtering before LLM/RAG processing;
anonymization pipelines for fintech, customer-support, and call-center transcripts;
research on multilingual PII detection for Russian and Kazakh.

Out-of-Scope Use

This model should not be used as the only privacy/security control in production without additional validation.

It may fail on:

noisy ASR transcripts;
Kazakh-only text;
mixed Russian/Kazakh/English conversations;
unseen document formats;
adversarial or deliberately obfuscated PII;
domains far from the training distribution.

For production use, combine the model with rule-based validators, logging, red-team tests, and human review for high-risk workflows.

Labels

The model was trained with a custom OPF-compatible label space.

Main labels include:

private_person
private_phone
private_email
private_address
private_date
social_number
account_number
id_card_number
driver_license_number
tax_number
username
secret

Usage

Install OpenAI Privacy Filter from the official repository:

git clone https://github.com/openai/privacy-filter.git
cd privacy-filter
pip install -e .

Download this checkpoint:

hf download YOUR_USERNAME/YOUR_MODEL_REPO \
  --local-dir ./opf-russian-pii-66k

Run inference:

opf --checkpoint ./opf-russian-pii-66k \
"Меня зовут Ержан Ахметов, мой ИИН 990101300123, телефон +7 777 123 45 67, адрес Алматы, Абая 10."

Training

Training was performed with OPF fine-tuning on bbeglerov/russian-pi-66k-opf.

Approximate training command:

opf train /workspace/data/russian-pi-66k-opf/data/train.jsonl \
  --label-space-json /workspace/data/russian-pi-66k-opf/label_space.json \
  --checkpoint /workspace/models/openai-privacy-filter \
  --output-dir /workspace/outputs/opf-russian-pii-66k \
  --epochs 1 \
  --batch-size 8 \
  --grad-accum-steps 2 \
  --learning-rate 1e-5 \
  --weight-decay 0.01 \
  --max-grad-norm 1.0 \
  --n-ctx 128 \
  --output-param-dtype bf16 \
  --device cuda

Evaluation

Evaluation was performed on the validation split of bbeglerov/russian-pi-66k-opf.

Typed Detection Metrics

Metric	Value
Precision	0.9990
Recall	0.9988
F1	0.9989
Span Precision	0.9961
Span Recall	0.9961
Span F1	0.9961
Token Accuracy	0.9984
Loss	0.0072

Per-Class Span F1

Label	F1
`account_number`	0.9988
`private_address`	0.9959
`private_date`	1.0000
`private_email`	0.9991
`private_person`	0.9929
`private_phone`	1.0000
`secret`	0.9968
`username`	0.9948
`tax_number`	0.9982
`id_card_number`	0.9912
`social_number`	1.0000
`driver_license_number`	0.9851

Important Evaluation Caveat

These results are measured on the validation split from the same dataset distribution used for fine-tuning.

High validation scores do not guarantee production performance on real call-center transcripts, noisy ASR output, Kazakh text, or mixed-language conversations.

Recommended additional evaluation:

negative set with no PII;
noisy ASR-style text;
mixed Russian/Kazakh/English text;
real manually reviewed fintech transcripts;
comparison against the base openai/privacy-filter checkpoint.

Limitations

The model may overfit to the structure and formatting of bbeglerov/russian-pi-66k-opf.

Known risk areas:

false positives on short person-like tokens;
reduced robustness on noisy transcription;
limited Kazakh coverage;
possible label confusion between numeric identifiers;
domain shift outside synthetic or converted PII datasets.

Roadmap

Planned next versions:

v2: mixed Russian + Kazakh PII dataset;
Kazakh-specific labels and examples;
ASR-noisy call-center evaluation set;
hard-negative benchmark;
comparison against base OPF and rule-based baselines.

Citation / Attribution

Base model:

openai/privacy-filter

Training dataset:

bbeglerov/russian-pi-66k-opf

Upstream source dataset:

wolframko/russian-pii-66k

Please check upstream dataset license and attribution requirements before using this model in public or commercial settings.

Downloads last month: 19

Model tree for bbeglerov/opf-russian-pii-66k

Base model

openai/privacy-filter

Finetuned

(30)

this model

bbeglerov
/

opf-russian-pii-66k