File size: 8,162 Bytes
d4f264a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dcd875e
 
 
 
 
 
 
 
 
 
 
d4f264a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
---
library_name: gliner2
language:
  - en
  - fr
  - es
  - de
  - it
  - pt
  - nl
tags:
  - pii
  - ner
  - privacy
  - redaction
  - gliner
  - gliner2
  - information-extraction
  - span-extraction
license: apache-2.0
datasets:
  - synthetic
pipeline_tag: token-classification
---
<div style="display: flex; flex-wrap: wrap; gap: 8px; margin-bottom: 16px;">
  <a href="https://arxiv.org/abs/2605.09973" target="_blank" rel="noreferrer" style="text-decoration:none;">
    <img src="https://img.shields.io/badge/arXiv-2605.07982-b31b1b.svg?logo=arxiv" alt="arXiv Paper" style="vertical-align:middle;">
  </a>
  <a href="https://pioneer.ai?utm_source=huggingface" target="_blank" rel="noreferrer" style="text-decoration:none;">
    <img src="https://img.shields.io/badge/Deploy-GLiNER2%20PII-FF7345" alt="Deploy GLiNER2-PII model with Pioneer" style="vertical-align:middle;">
  </a>
  <a href="https://x.com/fastinoAI" target="_blank" rel="noreferrer" style="text-decoration:none;">
    <img src="https://img.shields.io/twitter/follow/:fastinoAI" alt="Follow @fastinoAI" style="vertical-align:middle;">
  </a>
</div>

# GLiNER2-PII: Multilingual PII Detection & Masking

**GLiNER2-PII** is a fine-tune of the [GLiNER2](https://github.com/fastino-ai/GLiNER2) model (205M parameters) for detecting and masking personally identifiable information across **42 entity types** and **7 languages**.

Trained entirely on a constraint-driven synthetic corpus of 4,910 annotated texts, it achieves the **highest span-level F1 (0.477)** on the [SPY benchmark](https://aclanthology.org/2025.naacl-srw.23/) among four compared systems โ€” including OpenAI Privacy Filter, NVIDIA GLiNER-PII, and urchade/gliner\_multi\_pii-v1.

๐Ÿ“„ **[Technical Report](https://github.com/fastino-ai/GLiNER2)**  
๐Ÿ”— **[GitHub](https://github.com/fastino-ai/GLiNER2)**

---

## Quick Start

```bash
pip install gliner2
```

```python
from gliner2 import GLiNER2

model = GLiNER2.from_pretrained("fastino/gliner2-pii-v1")

text = "Email john.smith@acme.com or call +1 415 555 0199."
labels = ["email", "phone_number", "person"]

result = model.extract_entities(
    text,
    labels,
    threshold=0.5,
    include_confidence=True,
    include_spans=True,
)

print(result)
```

You can pass **any subset** of the 42 supported labels โ€” the model conditions on the labels you provide at inference time.

---

## Supported PII Labels (42 types)

| Group | Labels |
|---|---|
| **Person / names** | `person`, `full_name`, `first_name`, `middle_name`, `last_name`, `date_of_birth` |
| **Contact / address** | `email`, `phone_number`, `address`, `street_address`, `city`, `state_or_region`, `postal_code`, `country` |
| **Government / tax IDs** | `government_id`, `national_id_number`, `passport_number`, `drivers_license_number`, `license_number`, `tax_id`, `tax_number` |
| **Banking / payment** | `bank_account`, `account_number`, `routing_number`, `iban`, `payment_card`, `card_number`, `card_expiry`, `card_cvv` |
| **Digital identity** | `username`, `ip_address`, `account_id`, `sensitive_account_id` |
| **Secrets / credentials** | `password`, `secret`, `api_key`, `access_token`, `recovery_code` |
| **Sensitive dates** | `sensitive_date`, `document_date`, `expiration_date`, `transaction_date` |

---

## Benchmark Results (SPY)

Evaluated on the [SPY benchmark](https://aclanthology.org/2025.naacl-srw.23/) (Savkin et al., 2025) with exact-match span-level metrics:

| Model | Legal P | Legal R | Legal F1 | Medical P | Medical R | Medical F1 | **Avg F1** |
|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| **fastino/gliner2-pii-v1** | .346 | **.750** | **.473** | .369 | **.686** | **.480** | **.477** |
| nvidia/gliner-PII | .343 | .452 | .390 | .368 | .465 | .411 | .400 |
| urchade/gliner\_multi\_pii-v1 | **.467** | .317 | .377 | **.518** | .351 | .419 | .398 |
| openai/privacy-filter | .242 | .656 | .354 | .287 | .692 | .406 | .380 |

### Key takeaways

- **Highest F1** on both legal and medical domains.
- **Best recall** among GLiNER-based detectors (0.718 avg) โ€” critical for redaction workflows where missed spans are data leaks.
- Consistent performance across domains (< 2-point F1 difference).

---

## When to Use This Model

| Use case | Why GLiNER2-PII |
|---|---|
| **PII redaction / masking** | High recall minimises missed sensitive spans |
| **Data governance & GDPR/CCPA compliance** | 42 fine-grained types enable policy-specific routing |
| **Training-data hygiene** | Exact character spans for precise masking before model training |
| **Multi-language pipelines** | Trained on EN, FR, ES, DE, IT, PT, NL formats |

---

## Redaction Example

```python
def redact(text, labels, threshold=0.5):
    model = GLiNER2.from_pretrained("fastino/gliner2-pii-v1")
    result = model.extract_entities(
        text, labels, threshold=threshold,
        include_spans=True,
    )
    entities = result.get("entities", {})
    spans = []
    for label, values in entities.items():
        for value in values:
            start = text.find(value)
            if start != -1:
                spans.append((start, start + len(value), label))

    spans.sort(key=lambda s: s[0], reverse=True)
    redacted = text
    for start, end, label in spans:
        redacted = redacted[:start] + f"[{label.upper()}]" + redacted[end:]
    return redacted


text = "Please contact Maria Jensen at maria.jensen@example.dk or +45 20 12 34 56."
labels = ["person", "email", "phone_number"]
print(redact(text, labels))
# "Please contact [PERSON] at [EMAIL] or [PHONE_NUMBER]."
```

---

## Training Details

| Detail | Value |
|---|---|
| Base model | GLiNER2 (205M parameters) |
| Training data | 4,910 synthetic annotated texts |
| PII mentions | 129,951 total (mean 26.5 per example) |
| Generator | GPT-5.4 (temperature 0.01) |
| Data framework | Constraint-driven generation (same framework as [Pioneer Agent](https://arxiv.org/abs/2604.09791)) |
| Languages | English, French, Spanish, German, Italian, Portuguese, Dutch |
| Label types | 42 PII entity types across 7 semantic groups |

---

## Limitations

- **Precision** (0.35โ€“0.37 on SPY) leaves room for improvement; the model tends to over-predict `name` entities, sometimes confusing common nouns, organisation names, and product names with personal names.
- Evaluated on a **single benchmark** (SPY) covering two domains. Broader multilingual and fine-grained evaluation is ongoing.
- Training data is **fully synthetic** and has not been validated by human annotators.
- Performance on **non-European** locales and scripts has not been measured.

### Improving precision

For production use, consider:
- Per-label confidence thresholds (raise threshold for `person` / `full_name`)
- Dictionary-based filtering for common false positives
- Calibration on a small domain-specific development set

---

## Citation

```bibtex
@misc{fastino2026gliner2pii,
  title   = {GLiNER2-PII: Multilingual PII Extraction via Synthetic Fine-Tuning},
  author  = {{Fastino AI Team}},
  year    = {2026},
  url     = {https://huggingface.co/fastino/gliner2-pii-v1}
}
```

### Related work

```bibtex
@inproceedings{zaratiana-etal-2025-gliner2,
  title     = {GLiNER2: Schema-Driven Multi-Task Learning for Structured Information Extraction},
  author    = {Zaratiana, Urchade and Pasternak, Gil and Boyd, Oliver and Hurn-Maloney, George and Lewis, Ash},
  booktitle = {Proceedings of EMNLP 2025: System Demonstrations},
  year      = {2025}
}

@inproceedings{zaratiana-etal-2024-gliner,
  title     = {GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer},
  author    = {Zaratiana, Urchade and Tomeh, Nadi and Holat, Pierre and Charnois, Thierry},
  booktitle = {Proceedings of NAACL 2024},
  year      = {2024}
}

@misc{atreja2026pioneeragent,
  title  = {Pioneer Agent: Continual Improvement of Small Language Models in Production},
  author = {Atreja, Dhruv and White, Julia and Nayak, Nikhil and Zhang, Kelton and Princis, Henrijs and Hurn-Maloney, George and Lewis, Ash and Zaratiana, Urchade},
  year   = {2026},
  url    = {https://arxiv.org/abs/2604.09791}
}
```

---

## License

Apache 2.0