OPF Russian PII 66K — Fine-tuned OpenAI Privacy Filter

Fine-tuned checkpoint of openai/privacy-filter for Russian PII detection and anonymization tasks.

This model is intended for detecting personal and sensitive information in Russian fintech-style text, with future extension toward Kazakh and mixed Russian/Kazakh call-center transcripts.

Model Details

  • Base model: openai/privacy-filter
  • Task: PII detection / token classification / anonymization support
  • Domain: Russian fintech-style privacy filtering
  • Training dataset: bbeglerov/russian-pi-66k-opf
  • Checkpoint format: safetensors
  • Output: OPF-compatible span predictions

Intended Use

This checkpoint is intended for:

  • detecting PII in Russian text;
  • privacy filtering before LLM/RAG processing;
  • anonymization pipelines for fintech, customer-support, and call-center transcripts;
  • research on multilingual PII detection for Russian and Kazakh.

Out-of-Scope Use

This model should not be used as the only privacy/security control in production without additional validation.

It may fail on:

  • noisy ASR transcripts;
  • Kazakh-only text;
  • mixed Russian/Kazakh/English conversations;
  • unseen document formats;
  • adversarial or deliberately obfuscated PII;
  • domains far from the training distribution.

For production use, combine the model with rule-based validators, logging, red-team tests, and human review for high-risk workflows.

Labels

The model was trained with a custom OPF-compatible label space.

Main labels include:

  • private_person
  • private_phone
  • private_email
  • private_address
  • private_date
  • social_number
  • account_number
  • id_card_number
  • driver_license_number
  • tax_number
  • username
  • secret

Usage

Install OpenAI Privacy Filter from the official repository:

git clone https://github.com/openai/privacy-filter.git
cd privacy-filter
pip install -e .

Download this checkpoint:

hf download YOUR_USERNAME/YOUR_MODEL_REPO \
  --local-dir ./opf-russian-pii-66k

Run inference:

opf --checkpoint ./opf-russian-pii-66k \
"Меня зовут Ержан Ахметов, мой ИИН 990101300123, телефон +7 777 123 45 67, адрес Алматы, Абая 10."

Training

Training was performed with OPF fine-tuning on bbeglerov/russian-pi-66k-opf.

Approximate training command:

opf train /workspace/data/russian-pi-66k-opf/data/train.jsonl \
  --label-space-json /workspace/data/russian-pi-66k-opf/label_space.json \
  --checkpoint /workspace/models/openai-privacy-filter \
  --output-dir /workspace/outputs/opf-russian-pii-66k \
  --epochs 1 \
  --batch-size 8 \
  --grad-accum-steps 2 \
  --learning-rate 1e-5 \
  --weight-decay 0.01 \
  --max-grad-norm 1.0 \
  --n-ctx 128 \
  --output-param-dtype bf16 \
  --device cuda

Evaluation

Evaluation was performed on the validation split of bbeglerov/russian-pi-66k-opf.

Typed Detection Metrics

Metric Value
Precision 0.9990
Recall 0.9988
F1 0.9989
Span Precision 0.9961
Span Recall 0.9961
Span F1 0.9961
Token Accuracy 0.9984
Loss 0.0072

Per-Class Span F1

Label F1
account_number 0.9988
private_address 0.9959
private_date 1.0000
private_email 0.9991
private_person 0.9929
private_phone 1.0000
secret 0.9968
username 0.9948
tax_number 0.9982
id_card_number 0.9912
social_number 1.0000
driver_license_number 0.9851

Important Evaluation Caveat

These results are measured on the validation split from the same dataset distribution used for fine-tuning.

High validation scores do not guarantee production performance on real call-center transcripts, noisy ASR output, Kazakh text, or mixed-language conversations.

Recommended additional evaluation:

  • negative set with no PII;
  • noisy ASR-style text;
  • mixed Russian/Kazakh/English text;
  • real manually reviewed fintech transcripts;
  • comparison against the base openai/privacy-filter checkpoint.

Limitations

The model may overfit to the structure and formatting of bbeglerov/russian-pi-66k-opf.

Known risk areas:

  • false positives on short person-like tokens;
  • reduced robustness on noisy transcription;
  • limited Kazakh coverage;
  • possible label confusion between numeric identifiers;
  • domain shift outside synthetic or converted PII datasets.

Roadmap

Planned next versions:

  • v2: mixed Russian + Kazakh PII dataset;
  • Kazakh-specific labels and examples;
  • ASR-noisy call-center evaluation set;
  • hard-negative benchmark;
  • comparison against base OPF and rule-based baselines.

Citation / Attribution

Base model:

openai/privacy-filter

Training dataset:

bbeglerov/russian-pi-66k-opf

Upstream source dataset:

wolframko/russian-pii-66k

Please check upstream dataset license and attribution requirements before using this model in public or commercial settings.

Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bbeglerov/opf-russian-pii-66k

Finetuned
(30)
this model

Dataset used to train bbeglerov/opf-russian-pii-66k