Token Classification
GLiNER2
Safetensors
GLiNER
extractor
pii
ner
privacy
redaction
information-extraction
span-extraction
Instructions to use fastino/gliner2-privacy-filter-PII-multi with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- GLiNER2
How to use fastino/gliner2-privacy-filter-PII-multi with GLiNER2:
from gliner2 import GLiNER2 model = GLiNER2.from_pretrained("fastino/gliner2-privacy-filter-PII-multi") # Extract entities text = "Apple CEO Tim Cook announced iPhone 15 in Cupertino yesterday." result = extractor.extract_entities(text, ["company", "person", "product", "location"]) print(result) - GLiNER
How to use fastino/gliner2-privacy-filter-PII-multi with GLiNER:
from gliner import GLiNER model = GLiNER.from_pretrained("fastino/gliner2-privacy-filter-PII-multi") - Notebooks
- Google Colab
- Kaggle
File size: 8,162 Bytes
d4f264a dcd875e d4f264a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 | ---
library_name: gliner2
language:
- en
- fr
- es
- de
- it
- pt
- nl
tags:
- pii
- ner
- privacy
- redaction
- gliner
- gliner2
- information-extraction
- span-extraction
license: apache-2.0
datasets:
- synthetic
pipeline_tag: token-classification
---
<div style="display: flex; flex-wrap: wrap; gap: 8px; margin-bottom: 16px;">
<a href="https://arxiv.org/abs/2605.09973" target="_blank" rel="noreferrer" style="text-decoration:none;">
<img src="https://img.shields.io/badge/arXiv-2605.07982-b31b1b.svg?logo=arxiv" alt="arXiv Paper" style="vertical-align:middle;">
</a>
<a href="https://pioneer.ai?utm_source=huggingface" target="_blank" rel="noreferrer" style="text-decoration:none;">
<img src="https://img.shields.io/badge/Deploy-GLiNER2%20PII-FF7345" alt="Deploy GLiNER2-PII model with Pioneer" style="vertical-align:middle;">
</a>
<a href="https://x.com/fastinoAI" target="_blank" rel="noreferrer" style="text-decoration:none;">
<img src="https://img.shields.io/twitter/follow/:fastinoAI" alt="Follow @fastinoAI" style="vertical-align:middle;">
</a>
</div>
# GLiNER2-PII: Multilingual PII Detection & Masking
**GLiNER2-PII** is a fine-tune of the [GLiNER2](https://github.com/fastino-ai/GLiNER2) model (205M parameters) for detecting and masking personally identifiable information across **42 entity types** and **7 languages**.
Trained entirely on a constraint-driven synthetic corpus of 4,910 annotated texts, it achieves the **highest span-level F1 (0.477)** on the [SPY benchmark](https://aclanthology.org/2025.naacl-srw.23/) among four compared systems โ including OpenAI Privacy Filter, NVIDIA GLiNER-PII, and urchade/gliner\_multi\_pii-v1.
๐ **[Technical Report](https://github.com/fastino-ai/GLiNER2)**
๐ **[GitHub](https://github.com/fastino-ai/GLiNER2)**
---
## Quick Start
```bash
pip install gliner2
```
```python
from gliner2 import GLiNER2
model = GLiNER2.from_pretrained("fastino/gliner2-pii-v1")
text = "Email john.smith@acme.com or call +1 415 555 0199."
labels = ["email", "phone_number", "person"]
result = model.extract_entities(
text,
labels,
threshold=0.5,
include_confidence=True,
include_spans=True,
)
print(result)
```
You can pass **any subset** of the 42 supported labels โ the model conditions on the labels you provide at inference time.
---
## Supported PII Labels (42 types)
| Group | Labels |
|---|---|
| **Person / names** | `person`, `full_name`, `first_name`, `middle_name`, `last_name`, `date_of_birth` |
| **Contact / address** | `email`, `phone_number`, `address`, `street_address`, `city`, `state_or_region`, `postal_code`, `country` |
| **Government / tax IDs** | `government_id`, `national_id_number`, `passport_number`, `drivers_license_number`, `license_number`, `tax_id`, `tax_number` |
| **Banking / payment** | `bank_account`, `account_number`, `routing_number`, `iban`, `payment_card`, `card_number`, `card_expiry`, `card_cvv` |
| **Digital identity** | `username`, `ip_address`, `account_id`, `sensitive_account_id` |
| **Secrets / credentials** | `password`, `secret`, `api_key`, `access_token`, `recovery_code` |
| **Sensitive dates** | `sensitive_date`, `document_date`, `expiration_date`, `transaction_date` |
---
## Benchmark Results (SPY)
Evaluated on the [SPY benchmark](https://aclanthology.org/2025.naacl-srw.23/) (Savkin et al., 2025) with exact-match span-level metrics:
| Model | Legal P | Legal R | Legal F1 | Medical P | Medical R | Medical F1 | **Avg F1** |
|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| **fastino/gliner2-pii-v1** | .346 | **.750** | **.473** | .369 | **.686** | **.480** | **.477** |
| nvidia/gliner-PII | .343 | .452 | .390 | .368 | .465 | .411 | .400 |
| urchade/gliner\_multi\_pii-v1 | **.467** | .317 | .377 | **.518** | .351 | .419 | .398 |
| openai/privacy-filter | .242 | .656 | .354 | .287 | .692 | .406 | .380 |
### Key takeaways
- **Highest F1** on both legal and medical domains.
- **Best recall** among GLiNER-based detectors (0.718 avg) โ critical for redaction workflows where missed spans are data leaks.
- Consistent performance across domains (< 2-point F1 difference).
---
## When to Use This Model
| Use case | Why GLiNER2-PII |
|---|---|
| **PII redaction / masking** | High recall minimises missed sensitive spans |
| **Data governance & GDPR/CCPA compliance** | 42 fine-grained types enable policy-specific routing |
| **Training-data hygiene** | Exact character spans for precise masking before model training |
| **Multi-language pipelines** | Trained on EN, FR, ES, DE, IT, PT, NL formats |
---
## Redaction Example
```python
def redact(text, labels, threshold=0.5):
model = GLiNER2.from_pretrained("fastino/gliner2-pii-v1")
result = model.extract_entities(
text, labels, threshold=threshold,
include_spans=True,
)
entities = result.get("entities", {})
spans = []
for label, values in entities.items():
for value in values:
start = text.find(value)
if start != -1:
spans.append((start, start + len(value), label))
spans.sort(key=lambda s: s[0], reverse=True)
redacted = text
for start, end, label in spans:
redacted = redacted[:start] + f"[{label.upper()}]" + redacted[end:]
return redacted
text = "Please contact Maria Jensen at maria.jensen@example.dk or +45 20 12 34 56."
labels = ["person", "email", "phone_number"]
print(redact(text, labels))
# "Please contact [PERSON] at [EMAIL] or [PHONE_NUMBER]."
```
---
## Training Details
| Detail | Value |
|---|---|
| Base model | GLiNER2 (205M parameters) |
| Training data | 4,910 synthetic annotated texts |
| PII mentions | 129,951 total (mean 26.5 per example) |
| Generator | GPT-5.4 (temperature 0.01) |
| Data framework | Constraint-driven generation (same framework as [Pioneer Agent](https://arxiv.org/abs/2604.09791)) |
| Languages | English, French, Spanish, German, Italian, Portuguese, Dutch |
| Label types | 42 PII entity types across 7 semantic groups |
---
## Limitations
- **Precision** (0.35โ0.37 on SPY) leaves room for improvement; the model tends to over-predict `name` entities, sometimes confusing common nouns, organisation names, and product names with personal names.
- Evaluated on a **single benchmark** (SPY) covering two domains. Broader multilingual and fine-grained evaluation is ongoing.
- Training data is **fully synthetic** and has not been validated by human annotators.
- Performance on **non-European** locales and scripts has not been measured.
### Improving precision
For production use, consider:
- Per-label confidence thresholds (raise threshold for `person` / `full_name`)
- Dictionary-based filtering for common false positives
- Calibration on a small domain-specific development set
---
## Citation
```bibtex
@misc{fastino2026gliner2pii,
title = {GLiNER2-PII: Multilingual PII Extraction via Synthetic Fine-Tuning},
author = {{Fastino AI Team}},
year = {2026},
url = {https://huggingface.co/fastino/gliner2-pii-v1}
}
```
### Related work
```bibtex
@inproceedings{zaratiana-etal-2025-gliner2,
title = {GLiNER2: Schema-Driven Multi-Task Learning for Structured Information Extraction},
author = {Zaratiana, Urchade and Pasternak, Gil and Boyd, Oliver and Hurn-Maloney, George and Lewis, Ash},
booktitle = {Proceedings of EMNLP 2025: System Demonstrations},
year = {2025}
}
@inproceedings{zaratiana-etal-2024-gliner,
title = {GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer},
author = {Zaratiana, Urchade and Tomeh, Nadi and Holat, Pierre and Charnois, Thierry},
booktitle = {Proceedings of NAACL 2024},
year = {2024}
}
@misc{atreja2026pioneeragent,
title = {Pioneer Agent: Continual Improvement of Small Language Models in Production},
author = {Atreja, Dhruv and White, Julia and Nayak, Nikhil and Zhang, Kelton and Princis, Henrijs and Hurn-Maloney, George and Lewis, Ash and Zaratiana, Urchade},
year = {2026},
url = {https://arxiv.org/abs/2604.09791}
}
```
---
## License
Apache 2.0
|