Update README.md
Browse files
README.md
CHANGED
|
@@ -1,78 +1,151 @@
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
base_model: openai/privacy-filter
|
| 4 |
tags:
|
| 5 |
- token-classification
|
| 6 |
- pii-detection
|
|
|
|
| 7 |
- onnx
|
| 8 |
-
-
|
| 9 |
- privacy
|
| 10 |
-
- transformers.js
|
| 11 |
library_name: transformers
|
| 12 |
pipeline_tag: token-classification
|
|
|
|
|
|
|
| 13 |
---
|
| 14 |
|
| 15 |
-
# Privacy Filter
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
| 24 |
-
- `onnx/model_fp16.onnx.data` - weights (external data, ~2.6 GB)
|
| 25 |
-
- `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json` - tokenizer
|
| 26 |
-
- `config.json` - model config with the 33 BIOES label taxonomy
|
| 27 |
-
- `viterbi_calibration.json` - default operating-point biases for the Viterbi decoder
|
| 28 |
|
| 29 |
-
##
|
| 30 |
|
| 31 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
- `private_date`
|
| 36 |
-
- `private_email`
|
| 37 |
-
- `private_person`
|
| 38 |
-
- `private_phone`
|
| 39 |
-
- `private_url`
|
| 40 |
-
- `secret`
|
| 41 |
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
-
|
| 45 |
-
|
| 46 |
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
)
|
|
|
|
|
|
|
| 51 |
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
|
|
|
|
|
|
| 55 |
```
|
| 56 |
|
| 57 |
-
|
| 58 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
## Export notes
|
| 61 |
|
| 62 |
-
- Exported with `torch.onnx.export(dynamo=True)` from `transformers>=5.6.0.dev0`
|
| 63 |
-
-
|
| 64 |
-
|
| 65 |
-
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
## License
|
| 71 |
|
| 72 |
-
Apache 2.0,
|
| 73 |
|
| 74 |
-
##
|
| 75 |
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## README
|
| 2 |
+
|
| 3 |
+
```markdown
|
| 4 |
---
|
| 5 |
license: apache-2.0
|
| 6 |
base_model: openai/privacy-filter
|
| 7 |
tags:
|
| 8 |
- token-classification
|
| 9 |
- pii-detection
|
| 10 |
+
- pii-masking
|
| 11 |
- onnx
|
| 12 |
+
- onnxruntime
|
| 13 |
- privacy
|
|
|
|
| 14 |
library_name: transformers
|
| 15 |
pipeline_tag: token-classification
|
| 16 |
+
language:
|
| 17 |
+
- en
|
| 18 |
---
|
| 19 |
|
| 20 |
+
# Privacy Filter (ONNX, FP16)
|
| 21 |
+
|
| 22 |
+
FP16 ONNX export of [`openai/privacy-filter`](https://huggingface.co/openai/privacy-filter) for efficient inference with ONNX Runtime. The model detects eight categories of personally identifiable information (PII) in text and returns BIOES-tagged token spans.
|
| 23 |
+
|
| 24 |
+
The exported graph has dynamic batch and sequence dimensions, and has been validated against the original PyTorch implementation with 100% argmax agreement on reference prompts.
|
| 25 |
+
|
| 26 |
+
## Model details
|
| 27 |
+
|
| 28 |
+
| | |
|
| 29 |
+
|---|---|
|
| 30 |
+
| Base model | `openai/privacy-filter` |
|
| 31 |
+
| Parameters | 1.5 B total, 50 M active (128-expert top-4 MoE) |
|
| 32 |
+
| Precision | FP16 weights, FP32 router |
|
| 33 |
+
| Context length | Up to 128k tokens (dynamic) |
|
| 34 |
+
| Label set | 33 classes (`O` + BIOES × 8 categories) |
|
| 35 |
+
| License | Apache 2.0 |
|
| 36 |
+
|
| 37 |
+
### Detected categories
|
| 38 |
+
|
| 39 |
+
`account_number`, `private_address`, `private_date`, `private_email`, `private_person`, `private_phone`, `private_url`, `secret`
|
| 40 |
|
| 41 |
+
### Repository contents
|
| 42 |
+
|
| 43 |
+
```
|
| 44 |
+
config.json Model config including the 33-class id2label map
|
| 45 |
+
tokenizer.json o200k tokenizer (tiktoken-compatible)
|
| 46 |
+
tokenizer_config.json
|
| 47 |
+
special_tokens_map.json
|
| 48 |
+
viterbi_calibration.json Default operating-point biases for Viterbi decoding
|
| 49 |
+
onnx/
|
| 50 |
+
model_fp16.onnx Graph
|
| 51 |
+
model_fp16.onnx.data Weights (external data, ~2.6 GB)
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
## Installation
|
| 55 |
+
|
| 56 |
+
```bash
|
| 57 |
+
pip install onnxruntime transformers tiktoken numpy huggingface_hub
|
| 58 |
+
```
|
| 59 |
|
| 60 |
+
For GPU inference, substitute `onnxruntime-gpu` for `onnxruntime`.
|
| 61 |
|
| 62 |
+
## Usage
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
+
### Minimal example
|
| 65 |
|
| 66 |
+
```python
|
| 67 |
+
from huggingface_hub import snapshot_download
|
| 68 |
+
from transformers import AutoTokenizer
|
| 69 |
+
import onnxruntime as ort
|
| 70 |
+
import numpy as np
|
| 71 |
+
import json
|
| 72 |
|
| 73 |
+
repo = "yasserrmd/privacy-filter-ONNX"
|
| 74 |
+
local = snapshot_download(repo)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
+
tokenizer = AutoTokenizer.from_pretrained(local)
|
| 77 |
+
session = ort.InferenceSession(
|
| 78 |
+
f"{local}/onnx/model_fp16.onnx",
|
| 79 |
+
providers=["CPUExecutionProvider"], # or ["CUDAExecutionProvider"] for GPU
|
| 80 |
+
)
|
| 81 |
|
| 82 |
+
with open(f"{local}/config.json") as f:
|
| 83 |
+
id2label = {int(k): v for k, v in json.load(f)["id2label"].items()}
|
| 84 |
|
| 85 |
+
text = "Hi, I'm Alice Smith, email alice@example.com."
|
| 86 |
+
enc = tokenizer(text, return_tensors="np", add_special_tokens=False)
|
| 87 |
+
logits = session.run(None, {
|
| 88 |
+
"input_ids": enc["input_ids"].astype(np.int64),
|
| 89 |
+
"attention_mask": enc["attention_mask"].astype(np.int64),
|
| 90 |
+
})[0]
|
| 91 |
|
| 92 |
+
labels = [id2label[int(i)] for i in logits[0].argmax(-1)]
|
| 93 |
+
tokens = tokenizer.convert_ids_to_tokens(enc["input_ids"][0])
|
| 94 |
+
for tok, lbl in zip(tokens, labels):
|
| 95 |
+
if lbl != "O":
|
| 96 |
+
print(f"{tok:<20} {lbl}")
|
| 97 |
```
|
| 98 |
|
| 99 |
+
### Complete usage with span decoding
|
| 100 |
+
|
| 101 |
+
The raw model output is per-token logits over 33 BIOES classes. For coherent spans, decode the logits with a constrained Viterbi pass using the biases in `viterbi_calibration.json`. A reference implementation is included in `examples/detect.py` in the export project; the essential steps are:
|
| 102 |
+
|
| 103 |
+
1. Tokenize the input with `return_offsets_mapping=True` to recover character positions.
|
| 104 |
+
2. Run the ONNX session to obtain logits of shape `[1, seq_len, 33]`.
|
| 105 |
+
3. Run Viterbi decoding over the 33 labels with legal BIOES transitions.
|
| 106 |
+
4. Group the resulting label sequence into spans and map token indices back to character spans via the offsets.
|
| 107 |
+
|
| 108 |
+
The `viterbi_calibration.json` file holds six transition-bias parameters under `operating_points.default.biases` that control the precision/recall trade-off. The defaults in this file are zeroed and match the reference implementation's `default` operating point.
|
| 109 |
+
|
| 110 |
+
### Input and output shapes
|
| 111 |
+
|
| 112 |
+
| Tensor | Shape | Dtype |
|
| 113 |
+
|---|---|---|
|
| 114 |
+
| `input_ids` (input) | `[batch, sequence]` | `int64` |
|
| 115 |
+
| `attention_mask` (input) | `[batch, sequence]` | `int64` |
|
| 116 |
+
| `logits` (output) | `[batch, sequence, 33]` | `float32` |
|
| 117 |
+
|
| 118 |
+
Both `batch` and `sequence` are dynamic at runtime.
|
| 119 |
|
| 120 |
## Export notes
|
| 121 |
|
| 122 |
+
- Exported with `torch.onnx.export(dynamo=True)` from `transformers>=5.6.0.dev0` and `torch>=2.6`.
|
| 123 |
+
- The 128-expert top-4 MoE blocks in each decoder layer were rewritten to a dense-weighted-sum form to produce an ONNX-traceable graph while preserving reference arithmetic, including the clamped-SwiGLU activation (`alpha=1.702`, `limit=7.0`) and the post-experts scaling.
|
| 124 |
+
- The router linear remains in FP32 for numerical stability; all other weights are FP16.
|
| 125 |
+
- Parity validated against the PyTorch reference: max logit difference on the order of 1e-4, argmax agreement 100% across the standard evaluation prompts.
|
| 126 |
+
|
| 127 |
+
## Intended use and limitations
|
| 128 |
+
|
| 129 |
+
This export preserves the behavior of the base model. Its intended use, evaluation results, and limitations are documented in the [base model card](https://huggingface.co/openai/privacy-filter) and the accompanying [OpenAI Privacy Filter Model Card (PDF)](https://cdn.openai.com/pdf/c66281ed-b638-456a-8ce1-97e9f5264a90/OpenAI-Privacy-Filter-Model-Card.pdf). In brief:
|
| 130 |
+
|
| 131 |
+
- Optimized primarily for English; multilingual performance varies.
|
| 132 |
+
- Model-based redaction is a data-minimization aid, not an anonymization guarantee or compliance certification.
|
| 133 |
+
- For high-sensitivity domains (medical, legal, financial, government), pair with human review and organization-specific policies.
|
| 134 |
|
| 135 |
## License
|
| 136 |
|
| 137 |
+
Apache 2.0, inherited from the base model.
|
| 138 |
|
| 139 |
+
## Citation
|
| 140 |
|
| 141 |
+
If you use this export, please cite the base model:
|
| 142 |
+
|
| 143 |
+
```
|
| 144 |
+
@misc{openai_privacy_filter_2026,
|
| 145 |
+
title = {OpenAI Privacy Filter},
|
| 146 |
+
author = {OpenAI},
|
| 147 |
+
year = {2026},
|
| 148 |
+
howpublished = {\url{https://huggingface.co/openai/privacy-filter}},
|
| 149 |
+
note = {Apache-2.0},
|
| 150 |
+
}
|
| 151 |
+
```
|