File size: 2,659 Bytes
a1650a1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | ---
base_model: openai/privacy-filter
pipeline_tag: token-classification
language:
- ru
tags:
- privacy
- pii
- token-classification
- russian
- opf
model-index:
- name: privacy-filter-ru
results:
- task:
type: token-classification
name: Token Classification
dataset:
name: ru_realistic_eval_v1
type: local
metrics:
- type: f1
value: 0.9916
name: Raw span F1
- task:
type: token-classification
name: Token Classification
dataset:
name: ru_raw_hard_v3_eval
type: local
metrics:
- type: f1
value: 1.0
name: Raw span F1
---
# privacy-filter-ru
Russian PII fine-tune of [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter).
This checkpoint is the raw-model production candidate from the local `raw_hardening_v3` run. It is intended to run without deterministic post-processing.
## Labels
- `private_person`
- `private_phone`
- `private_email`
- `private_address`
- `private_date`
- `private_url`
- `account_number`
- `secret`
## Training
- Base checkpoint: `checkpoints/production_candidate_ru_v2`
- Original base model: `openai/privacy-filter`
- Epochs: 1
- Learning rate: `1e-6`
- Batch size: 1
- Gradient accumulation steps: 16
- Serialization dtype: `bfloat16`
- Train examples: 17,000
- Validation examples: 2,000
The v3 training mix targeted raw-model behavior that previously depended on a deterministic runtime layer: phone/account/secret label separation and person/date boundary cleanup.
## Raw Evaluation
No deterministic post-processing was used for these metrics.
| Eval | v2 raw span F1 | v3 raw span F1 | v2 mismatch rows | v3 mismatch rows |
| --- | ---: | ---: | ---: | ---: |
| synthetic test | 1.0000 | 1.0000 | 0 | 0 |
| ru_realistic_eval_v1 | 0.8787 | 0.9916 | 158 | 11 |
| ru_phone_account_confusion_v1 | 1.0000 | 1.0000 | 0 | 0 |
| ru_date_negative_v1 | 1.0000 | 1.0000 | 0 | 0 |
| ru_raw_hard_v3_eval | 0.8350 | 1.0000 | 297 | 0 |
| ru_person_hard_eval | 0.8074 | 0.8074 | 183 | 183 |
| alexen2 | 0.8644 | 0.8547 | 228 | 241 |
| Rubai heldout | 0.8054 | 0.8036 | 3,131 | 3,136 |
## Usage
```bash
opf --checkpoint apararti/privacy-filter-ru --device cuda "Мой номер 8 999 863 37 84, зовут Андрей Макаров."
```
For a local checkout:
```bash
opf --checkpoint ./privacy-filter-ru --device cuda "Перезвоните Наталье Никитиной на 8 903 914 81 88."
```
## Notes
This is a fine-tuned checkpoint, not the original OpenAI model. It is optimized for Russian PII filtering and should be validated on domain-specific shadow traffic before production rollout.
|