| --- |
| base_model: openai/privacy-filter |
| pipeline_tag: token-classification |
| language: |
| - ru |
| tags: |
| - privacy |
| - pii |
| - token-classification |
| - russian |
| - opf |
| model-index: |
| - name: privacy-filter-ru |
| results: |
| - task: |
| type: token-classification |
| name: Token Classification |
| dataset: |
| name: ru_realistic_eval_v1 |
| type: local |
| metrics: |
| - type: f1 |
| value: 0.9916 |
| name: Raw span F1 |
| - task: |
| type: token-classification |
| name: Token Classification |
| dataset: |
| name: ru_raw_hard_v3_eval |
| type: local |
| metrics: |
| - type: f1 |
| value: 1.0 |
| name: Raw span F1 |
| --- |
| |
| # privacy-filter-ru |
|
|
| Russian PII fine-tune of [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter). |
|
|
| This checkpoint is the raw-model production candidate from the local `raw_hardening_v3` run. It is intended to run without deterministic post-processing. |
|
|
| ## Labels |
|
|
| - `private_person` |
| - `private_phone` |
| - `private_email` |
| - `private_address` |
| - `private_date` |
| - `private_url` |
| - `account_number` |
| - `secret` |
|
|
| ## Training |
|
|
| - Base checkpoint: `checkpoints/production_candidate_ru_v2` |
| - Original base model: `openai/privacy-filter` |
| - Epochs: 1 |
| - Learning rate: `1e-6` |
| - Batch size: 1 |
| - Gradient accumulation steps: 16 |
| - Serialization dtype: `bfloat16` |
| - Train examples: 17,000 |
| - Validation examples: 2,000 |
|
|
| The v3 training mix targeted raw-model behavior that previously depended on a deterministic runtime layer: phone/account/secret label separation and person/date boundary cleanup. |
|
|
| ## Raw Evaluation |
|
|
| No deterministic post-processing was used for these metrics. |
|
|
| | Eval | v2 raw span F1 | v3 raw span F1 | v2 mismatch rows | v3 mismatch rows | |
| | --- | ---: | ---: | ---: | ---: | |
| | synthetic test | 1.0000 | 1.0000 | 0 | 0 | |
| | ru_realistic_eval_v1 | 0.8787 | 0.9916 | 158 | 11 | |
| | ru_phone_account_confusion_v1 | 1.0000 | 1.0000 | 0 | 0 | |
| | ru_date_negative_v1 | 1.0000 | 1.0000 | 0 | 0 | |
| | ru_raw_hard_v3_eval | 0.8350 | 1.0000 | 297 | 0 | |
| | ru_person_hard_eval | 0.8074 | 0.8074 | 183 | 183 | |
| | alexen2 | 0.8644 | 0.8547 | 228 | 241 | |
| | Rubai heldout | 0.8054 | 0.8036 | 3,131 | 3,136 | |
| |
| ## Usage |
| |
| ```bash |
| opf --checkpoint apararti/privacy-filter-ru --device cuda "Мой номер 8 999 863 37 84, зовут Андрей Макаров." |
| ``` |
| |
| For a local checkout: |
| |
| ```bash |
| opf --checkpoint ./privacy-filter-ru --device cuda "Перезвоните Наталье Никитиной на 8 903 914 81 88." |
| ``` |
| |
| ## Notes |
| |
| This is a fine-tuned checkpoint, not the original OpenAI model. It is optimized for Russian PII filtering and should be validated on domain-specific shadow traffic before production rollout. |
| |