urchade commited on
Commit
d4f264a
ยท
verified ยท
1 Parent(s): b56fb9a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +212 -0
README.md ADDED
@@ -0,0 +1,212 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: gliner2
3
+ language:
4
+ - en
5
+ - fr
6
+ - es
7
+ - de
8
+ - it
9
+ - pt
10
+ - nl
11
+ tags:
12
+ - pii
13
+ - ner
14
+ - privacy
15
+ - redaction
16
+ - gliner
17
+ - gliner2
18
+ - information-extraction
19
+ - span-extraction
20
+ license: apache-2.0
21
+ datasets:
22
+ - synthetic
23
+ pipeline_tag: token-classification
24
+ ---
25
+
26
+ # GLiNER2-PII: Multilingual PII Detection & Masking
27
+
28
+ **GLiNER2-PII** is a fine-tune of the [GLiNER2](https://github.com/fastino-ai/GLiNER2) model (205M parameters) for detecting and masking personally identifiable information across **42 entity types** and **7 languages**.
29
+
30
+ Trained entirely on a constraint-driven synthetic corpus of 4,910 annotated texts, it achieves the **highest span-level F1 (0.477)** on the [SPY benchmark](https://aclanthology.org/2025.naacl-srw.23/) among four compared systems โ€” including OpenAI Privacy Filter, NVIDIA GLiNER-PII, and urchade/gliner\_multi\_pii-v1.
31
+
32
+ ๐Ÿ“„ **[Technical Report](https://github.com/fastino-ai/GLiNER2)**
33
+ ๐Ÿ”— **[GitHub](https://github.com/fastino-ai/GLiNER2)**
34
+
35
+ ---
36
+
37
+ ## Quick Start
38
+
39
+ ```bash
40
+ pip install gliner2
41
+ ```
42
+
43
+ ```python
44
+ from gliner2 import GLiNER2
45
+
46
+ model = GLiNER2.from_pretrained("fastino/gliner2-pii-v1")
47
+
48
+ text = "Email john.smith@acme.com or call +1 415 555 0199."
49
+ labels = ["email", "phone_number", "person"]
50
+
51
+ result = model.extract_entities(
52
+ text,
53
+ labels,
54
+ threshold=0.5,
55
+ include_confidence=True,
56
+ include_spans=True,
57
+ )
58
+
59
+ print(result)
60
+ ```
61
+
62
+ You can pass **any subset** of the 42 supported labels โ€” the model conditions on the labels you provide at inference time.
63
+
64
+ ---
65
+
66
+ ## Supported PII Labels (42 types)
67
+
68
+ | Group | Labels |
69
+ |---|---|
70
+ | **Person / names** | `person`, `full_name`, `first_name`, `middle_name`, `last_name`, `date_of_birth` |
71
+ | **Contact / address** | `email`, `phone_number`, `address`, `street_address`, `city`, `state_or_region`, `postal_code`, `country` |
72
+ | **Government / tax IDs** | `government_id`, `national_id_number`, `passport_number`, `drivers_license_number`, `license_number`, `tax_id`, `tax_number` |
73
+ | **Banking / payment** | `bank_account`, `account_number`, `routing_number`, `iban`, `payment_card`, `card_number`, `card_expiry`, `card_cvv` |
74
+ | **Digital identity** | `username`, `ip_address`, `account_id`, `sensitive_account_id` |
75
+ | **Secrets / credentials** | `password`, `secret`, `api_key`, `access_token`, `recovery_code` |
76
+ | **Sensitive dates** | `sensitive_date`, `document_date`, `expiration_date`, `transaction_date` |
77
+
78
+ ---
79
+
80
+ ## Benchmark Results (SPY)
81
+
82
+ Evaluated on the [SPY benchmark](https://aclanthology.org/2025.naacl-srw.23/) (Savkin et al., 2025) with exact-match span-level metrics:
83
+
84
+ | Model | Legal P | Legal R | Legal F1 | Medical P | Medical R | Medical F1 | **Avg F1** |
85
+ |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
86
+ | **fastino/gliner2-pii-v1** | .346 | **.750** | **.473** | .369 | **.686** | **.480** | **.477** |
87
+ | nvidia/gliner-PII | .343 | .452 | .390 | .368 | .465 | .411 | .400 |
88
+ | urchade/gliner\_multi\_pii-v1 | **.467** | .317 | .377 | **.518** | .351 | .419 | .398 |
89
+ | openai/privacy-filter | .242 | .656 | .354 | .287 | .692 | .406 | .380 |
90
+
91
+ ### Key takeaways
92
+
93
+ - **Highest F1** on both legal and medical domains.
94
+ - **Best recall** among GLiNER-based detectors (0.718 avg) โ€” critical for redaction workflows where missed spans are data leaks.
95
+ - Consistent performance across domains (< 2-point F1 difference).
96
+
97
+ ---
98
+
99
+ ## When to Use This Model
100
+
101
+ | Use case | Why GLiNER2-PII |
102
+ |---|---|
103
+ | **PII redaction / masking** | High recall minimises missed sensitive spans |
104
+ | **Data governance & GDPR/CCPA compliance** | 42 fine-grained types enable policy-specific routing |
105
+ | **Training-data hygiene** | Exact character spans for precise masking before model training |
106
+ | **Multi-language pipelines** | Trained on EN, FR, ES, DE, IT, PT, NL formats |
107
+
108
+ ---
109
+
110
+ ## Redaction Example
111
+
112
+ ```python
113
+ def redact(text, labels, threshold=0.5):
114
+ model = GLiNER2.from_pretrained("fastino/gliner2-pii-v1")
115
+ result = model.extract_entities(
116
+ text, labels, threshold=threshold,
117
+ include_spans=True,
118
+ )
119
+ entities = result.get("entities", {})
120
+ spans = []
121
+ for label, values in entities.items():
122
+ for value in values:
123
+ start = text.find(value)
124
+ if start != -1:
125
+ spans.append((start, start + len(value), label))
126
+
127
+ spans.sort(key=lambda s: s[0], reverse=True)
128
+ redacted = text
129
+ for start, end, label in spans:
130
+ redacted = redacted[:start] + f"[{label.upper()}]" + redacted[end:]
131
+ return redacted
132
+
133
+
134
+ text = "Please contact Maria Jensen at maria.jensen@example.dk or +45 20 12 34 56."
135
+ labels = ["person", "email", "phone_number"]
136
+ print(redact(text, labels))
137
+ # "Please contact [PERSON] at [EMAIL] or [PHONE_NUMBER]."
138
+ ```
139
+
140
+ ---
141
+
142
+ ## Training Details
143
+
144
+ | Detail | Value |
145
+ |---|---|
146
+ | Base model | GLiNER2 (205M parameters) |
147
+ | Training data | 4,910 synthetic annotated texts |
148
+ | PII mentions | 129,951 total (mean 26.5 per example) |
149
+ | Generator | GPT-5.4 (temperature 0.01) |
150
+ | Data framework | Constraint-driven generation (same framework as [Pioneer Agent](https://arxiv.org/abs/2604.09791)) |
151
+ | Languages | English, French, Spanish, German, Italian, Portuguese, Dutch |
152
+ | Label types | 42 PII entity types across 7 semantic groups |
153
+
154
+ ---
155
+
156
+ ## Limitations
157
+
158
+ - **Precision** (0.35โ€“0.37 on SPY) leaves room for improvement; the model tends to over-predict `name` entities, sometimes confusing common nouns, organisation names, and product names with personal names.
159
+ - Evaluated on a **single benchmark** (SPY) covering two domains. Broader multilingual and fine-grained evaluation is ongoing.
160
+ - Training data is **fully synthetic** and has not been validated by human annotators.
161
+ - Performance on **non-European** locales and scripts has not been measured.
162
+
163
+ ### Improving precision
164
+
165
+ For production use, consider:
166
+ - Per-label confidence thresholds (raise threshold for `person` / `full_name`)
167
+ - Dictionary-based filtering for common false positives
168
+ - Calibration on a small domain-specific development set
169
+
170
+ ---
171
+
172
+ ## Citation
173
+
174
+ ```bibtex
175
+ @misc{fastino2026gliner2pii,
176
+ title = {GLiNER2-PII: Multilingual PII Extraction via Synthetic Fine-Tuning},
177
+ author = {{Fastino AI Team}},
178
+ year = {2026},
179
+ url = {https://huggingface.co/fastino/gliner2-pii-v1}
180
+ }
181
+ ```
182
+
183
+ ### Related work
184
+
185
+ ```bibtex
186
+ @inproceedings{zaratiana-etal-2025-gliner2,
187
+ title = {GLiNER2: Schema-Driven Multi-Task Learning for Structured Information Extraction},
188
+ author = {Zaratiana, Urchade and Pasternak, Gil and Boyd, Oliver and Hurn-Maloney, George and Lewis, Ash},
189
+ booktitle = {Proceedings of EMNLP 2025: System Demonstrations},
190
+ year = {2025}
191
+ }
192
+
193
+ @inproceedings{zaratiana-etal-2024-gliner,
194
+ title = {GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer},
195
+ author = {Zaratiana, Urchade and Tomeh, Nadi and Holat, Pierre and Charnois, Thierry},
196
+ booktitle = {Proceedings of NAACL 2024},
197
+ year = {2024}
198
+ }
199
+
200
+ @misc{atreja2026pioneeragent,
201
+ title = {Pioneer Agent: Continual Improvement of Small Language Models in Production},
202
+ author = {Atreja, Dhruv and White, Julia and Nayak, Nikhil and Zhang, Kelton and Princis, Henrijs and Hurn-Maloney, George and Lewis, Ash and Zaratiana, Urchade},
203
+ year = {2026},
204
+ url = {https://arxiv.org/abs/2604.09791}
205
+ }
206
+ ```
207
+
208
+ ---
209
+
210
+ ## License
211
+
212
+ Apache 2.0