binga commited on
Commit
e140ddc
·
verified ·
1 Parent(s): 5d38e20

Multi-task: PII NER (F1=0.49) + 10-class doc clf (acc=0.25)

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: openai/privacy-filter
4
+ tags:
5
+ - token-classification
6
+ - text-classification
7
+ - multi-task
8
+ - pii-detection
9
+ - document-classification
10
+ - privacy
11
+ datasets:
12
+ - ai4privacy/pii-masking-400k
13
+ - community-datasets/yahoo_answers_topics
14
+ metrics:
15
+ - f1
16
+ - accuracy
17
+ model-index:
18
+ - name: privacy-filter-multitask
19
+ results:
20
+ - task:
21
+ type: token-classification
22
+ name: PII Detection
23
+ dataset:
24
+ name: ai4privacy/pii-masking-400k
25
+ type: ai4privacy/pii-masking-400k
26
+ metrics:
27
+ - type: f1
28
+ value: 0.4925
29
+ - type: precision
30
+ value: 0.6968
31
+ - type: recall
32
+ value: 0.3809
33
+ - task:
34
+ type: text-classification
35
+ name: Document Classification
36
+ dataset:
37
+ name: yahoo_answers_topics
38
+ type: community-datasets/yahoo_answers_topics
39
+ metrics:
40
+ - type: accuracy
41
+ value: 0.2482
42
+ ---
43
+
44
+ # Privacy Filter Multi-Task 🔒📄
45
+
46
+ A **single model** for simultaneous **PII Detection (NER)** and **Document Classification (10 categories)**.
47
+
48
+ Adapted from [openai/privacy-filter](https://huggingface.co/openai/privacy-filter) (1.4B Sparse MoE, ~50M active params/token).
49
+
50
+ ## Architecture
51
+
52
+ ```
53
+ Input → BPE Tokenizer (200K vocab)
54
+
55
+ 8-layer Sparse MoE Transformer (128 experts, top-4)
56
+ ↓ ↓
57
+ NER Head (640→33) Doc Head (mean-pool → 640→10)
58
+ ↓ ↓
59
+ BIOES PII tags 10 categories
60
+ ```
61
+
62
+ ## Results
63
+
64
+ | Task | Metric | Value |
65
+ |------|--------|-------|
66
+ | PII NER | F1 (strict, span) | **0.493** |
67
+ | PII NER | Precision | 0.697 |
68
+ | PII NER | Recall | 0.381 |
69
+ | PII NER | Token Accuracy | 0.944 |
70
+ | Doc Clf | Val Accuracy | 0.255 |
71
+ | Doc Clf | Test Accuracy | **0.248** |
72
+
73
+ ### Inference Speed
74
+ | Device | Latency |
75
+ |--------|---------|
76
+ | GPU A10G (bf16) | 178 ms |
77
+ | CPU (fp32) | 202 ms |
78
+
79
+ ## PII Entity Types
80
+ `private_person` • `private_email` • `private_phone` • `private_address` • `private_date` • `private_url` • `account_number` • `secret`
81
+
82
+ ## Document Categories
83
+ Society & Culture • Science & Math • Health • Education • Computers & Internet • Sports • Business & Finance • Entertainment • Family • Politics
84
+
85
+ ## Usage
86
+
87
+ ```python
88
+ import torch, torch.nn as nn
89
+ from transformers import AutoModelForTokenClassification, AutoTokenizer
90
+ from huggingface_hub import hf_hub_download
91
+
92
+ tokenizer = AutoTokenizer.from_pretrained("binga/privacy-filter-multitask")
93
+ model = AutoModelForTokenClassification.from_pretrained(
94
+ "binga/privacy-filter-multitask", dtype=torch.bfloat16, device_map="auto"
95
+ )
96
+ doc_head = nn.Linear(640, 10)
97
+ doc_head.load_state_dict(torch.load(
98
+ hf_hub_download("binga/privacy-filter-multitask", "doc_head.pt"),
99
+ weights_only=True, map_location=model.device
100
+ ))
101
+ doc_head = doc_head.to(dtype=torch.bfloat16, device=model.device)
102
+
103
+ text = "John Smith (SSN: 123-45-6789) emailed john@corp.com"
104
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
105
+ with torch.no_grad():
106
+ out = model(**inputs, output_hidden_states=True)
107
+
108
+ # PII
109
+ for t, p in zip(tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]),
110
+ out.logits.argmax(-1)[0]):
111
+ label = model.config.id2label[p.item()]
112
+ if label != "O": print(f" {t} → {label}")
113
+
114
+ # Doc class
115
+ cats = ["Society", "Science", "Health", "Education", "Computers",
116
+ "Sports", "Business", "Entertainment", "Family", "Politics"]
117
+ h = out.hidden_states[-1]
118
+ m = inputs["attention_mask"].unsqueeze(-1).to(h.dtype)
119
+ pooled = (h * m).sum(1) / m.sum(1).clamp(min=1)
120
+ print(f"Category: {cats[doc_head(pooled).argmax().item()]}")
121
+ ```
122
+
123
+ ## Training
124
+ - Partial fine-tune: last 4/8 MoE layers + heads (636M/1.4B trainable)
125
+ - NER: ai4privacy/pii-masking-400k (20K en), Doc: yahoo_answers_topics (20K)
126
+ - Loss: NER×1.0 + Doc×0.5, AdamW LR=2e-5, cosine, 2 epochs, BS=16
config.json ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "OpenAIPrivacyFilterForTokenClassification"
4
+ ],
5
+ "attention_bias": true,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": null,
8
+ "classifier_dropout": 0.0,
9
+ "default_n_ctx": 128000,
10
+ "dtype": "bfloat16",
11
+ "eos_token_id": 199999,
12
+ "head_dim": 64,
13
+ "hidden_act": "silu",
14
+ "hidden_size": 640,
15
+ "id2label": {
16
+ "0": "O",
17
+ "1": "B-account_number",
18
+ "2": "I-account_number",
19
+ "3": "E-account_number",
20
+ "4": "S-account_number",
21
+ "5": "B-private_address",
22
+ "6": "I-private_address",
23
+ "7": "E-private_address",
24
+ "8": "S-private_address",
25
+ "9": "B-private_date",
26
+ "10": "I-private_date",
27
+ "11": "E-private_date",
28
+ "12": "S-private_date",
29
+ "13": "B-private_email",
30
+ "14": "I-private_email",
31
+ "15": "E-private_email",
32
+ "16": "S-private_email",
33
+ "17": "B-private_person",
34
+ "18": "I-private_person",
35
+ "19": "E-private_person",
36
+ "20": "S-private_person",
37
+ "21": "B-private_phone",
38
+ "22": "I-private_phone",
39
+ "23": "E-private_phone",
40
+ "24": "S-private_phone",
41
+ "25": "B-private_url",
42
+ "26": "I-private_url",
43
+ "27": "E-private_url",
44
+ "28": "S-private_url",
45
+ "29": "B-secret",
46
+ "30": "I-secret",
47
+ "31": "E-secret",
48
+ "32": "S-secret"
49
+ },
50
+ "initial_context_length": 4096,
51
+ "initializer_range": 0.02,
52
+ "intermediate_size": 640,
53
+ "label2id": {
54
+ "B-account_number": 1,
55
+ "B-private_address": 5,
56
+ "B-private_date": 9,
57
+ "B-private_email": 13,
58
+ "B-private_person": 17,
59
+ "B-private_phone": 21,
60
+ "B-private_url": 25,
61
+ "B-secret": 29,
62
+ "E-account_number": 3,
63
+ "E-private_address": 7,
64
+ "E-private_date": 11,
65
+ "E-private_email": 15,
66
+ "E-private_person": 19,
67
+ "E-private_phone": 23,
68
+ "E-private_url": 27,
69
+ "E-secret": 31,
70
+ "I-account_number": 2,
71
+ "I-private_address": 6,
72
+ "I-private_date": 10,
73
+ "I-private_email": 14,
74
+ "I-private_person": 18,
75
+ "I-private_phone": 22,
76
+ "I-private_url": 26,
77
+ "I-secret": 30,
78
+ "O": 0,
79
+ "S-account_number": 4,
80
+ "S-private_address": 8,
81
+ "S-private_date": 12,
82
+ "S-private_email": 16,
83
+ "S-private_person": 20,
84
+ "S-private_phone": 24,
85
+ "S-private_url": 28,
86
+ "S-secret": 32
87
+ },
88
+ "max_position_embeddings": 131072,
89
+ "model_type": "openai_privacy_filter",
90
+ "num_attention_heads": 14,
91
+ "num_experts_per_tok": 4,
92
+ "num_hidden_layers": 8,
93
+ "num_key_value_heads": 2,
94
+ "num_local_experts": 128,
95
+ "output_router_logits": false,
96
+ "pad_token_id": 199999,
97
+ "rms_norm_eps": 1e-05,
98
+ "rope_parameters": {
99
+ "beta_fast": 32.0,
100
+ "beta_slow": 1.0,
101
+ "factor": 32.0,
102
+ "original_max_position_embeddings": 4096,
103
+ "rope_theta": 150000.0,
104
+ "rope_type": "yarn",
105
+ "truncate": false
106
+ },
107
+ "router_aux_loss_coef": 0.001,
108
+ "sliding_window": 128,
109
+ "tie_word_embeddings": false,
110
+ "transformers.js_config": {
111
+ "use_external_data_format": {
112
+ "model": 1,
113
+ "model.onnx": 3,
114
+ "model_fp16.onnx": 2
115
+ }
116
+ },
117
+ "transformers_version": "5.6.2",
118
+ "use_cache": false,
119
+ "vocab_size": 200064
120
+ }
doc_head.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8c9efd1c327f2e817f4247b645023afe31c927bdf1f17acbd4cad4aee3205b49
3
+ size 14701
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6b110c50cfd24bb8e63e0301e48033fe9ec444466192d00598ee5a3170ed9d1e
3
+ size 2798989498
multitask_config.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "num_doc_classes": 10,
3
+ "doc_label_names": [
4
+ "Society & Culture",
5
+ "Science & Mathematics",
6
+ "Health",
7
+ "Education & Reference",
8
+ "Computers & Internet",
9
+ "Sports",
10
+ "Business & Finance",
11
+ "Entertainment & Music",
12
+ "Family & Relationships",
13
+ "Politics & Government"
14
+ ],
15
+ "ner_weight": 1.0,
16
+ "doc_weight": 0.5
17
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e714c627d94fd333b14f9ff32436219a4d7ac969719efe340fdc3385e1c7cd3e
3
+ size 27868272
tokenizer_config.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "eos_token": "<|endoftext|>",
4
+ "is_local": true,
5
+ "local_files_only": false,
6
+ "model_input_names": [
7
+ "input_ids",
8
+ "attention_mask"
9
+ ],
10
+ "model_max_length": 128000,
11
+ "pad_token": "<|endoftext|>",
12
+ "tokenizer_class": "TokenizersBackend"
13
+ }