Token Classification
GLiNER2
Safetensors
GLiNER
extractor
pii
ner
privacy
redaction
information-extraction
span-extraction
Instructions to use fastino/gliner2-privacy-filter-PII-multi with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- GLiNER2
How to use fastino/gliner2-privacy-filter-PII-multi with GLiNER2:
from gliner2 import GLiNER2 model = GLiNER2.from_pretrained("fastino/gliner2-privacy-filter-PII-multi") # Extract entities text = "Apple CEO Tim Cook announced iPhone 15 in Cupertino yesterday." result = extractor.extract_entities(text, ["company", "person", "product", "location"]) print(result) - GLiNER
How to use fastino/gliner2-privacy-filter-PII-multi with GLiNER:
from gliner import GLiNER model = GLiNER.from_pretrained("fastino/gliner2-privacy-filter-PII-multi") - Notebooks
- Google Colab
- Kaggle
Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,212 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: gliner2
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
- fr
|
| 6 |
+
- es
|
| 7 |
+
- de
|
| 8 |
+
- it
|
| 9 |
+
- pt
|
| 10 |
+
- nl
|
| 11 |
+
tags:
|
| 12 |
+
- pii
|
| 13 |
+
- ner
|
| 14 |
+
- privacy
|
| 15 |
+
- redaction
|
| 16 |
+
- gliner
|
| 17 |
+
- gliner2
|
| 18 |
+
- information-extraction
|
| 19 |
+
- span-extraction
|
| 20 |
+
license: apache-2.0
|
| 21 |
+
datasets:
|
| 22 |
+
- synthetic
|
| 23 |
+
pipeline_tag: token-classification
|
| 24 |
+
---
|
| 25 |
+
|
| 26 |
+
# GLiNER2-PII: Multilingual PII Detection & Masking
|
| 27 |
+
|
| 28 |
+
**GLiNER2-PII** is a fine-tune of the [GLiNER2](https://github.com/fastino-ai/GLiNER2) model (205M parameters) for detecting and masking personally identifiable information across **42 entity types** and **7 languages**.
|
| 29 |
+
|
| 30 |
+
Trained entirely on a constraint-driven synthetic corpus of 4,910 annotated texts, it achieves the **highest span-level F1 (0.477)** on the [SPY benchmark](https://aclanthology.org/2025.naacl-srw.23/) among four compared systems โ including OpenAI Privacy Filter, NVIDIA GLiNER-PII, and urchade/gliner\_multi\_pii-v1.
|
| 31 |
+
|
| 32 |
+
๐ **[Technical Report](https://github.com/fastino-ai/GLiNER2)**
|
| 33 |
+
๐ **[GitHub](https://github.com/fastino-ai/GLiNER2)**
|
| 34 |
+
|
| 35 |
+
---
|
| 36 |
+
|
| 37 |
+
## Quick Start
|
| 38 |
+
|
| 39 |
+
```bash
|
| 40 |
+
pip install gliner2
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
```python
|
| 44 |
+
from gliner2 import GLiNER2
|
| 45 |
+
|
| 46 |
+
model = GLiNER2.from_pretrained("fastino/gliner2-pii-v1")
|
| 47 |
+
|
| 48 |
+
text = "Email john.smith@acme.com or call +1 415 555 0199."
|
| 49 |
+
labels = ["email", "phone_number", "person"]
|
| 50 |
+
|
| 51 |
+
result = model.extract_entities(
|
| 52 |
+
text,
|
| 53 |
+
labels,
|
| 54 |
+
threshold=0.5,
|
| 55 |
+
include_confidence=True,
|
| 56 |
+
include_spans=True,
|
| 57 |
+
)
|
| 58 |
+
|
| 59 |
+
print(result)
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
You can pass **any subset** of the 42 supported labels โ the model conditions on the labels you provide at inference time.
|
| 63 |
+
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
## Supported PII Labels (42 types)
|
| 67 |
+
|
| 68 |
+
| Group | Labels |
|
| 69 |
+
|---|---|
|
| 70 |
+
| **Person / names** | `person`, `full_name`, `first_name`, `middle_name`, `last_name`, `date_of_birth` |
|
| 71 |
+
| **Contact / address** | `email`, `phone_number`, `address`, `street_address`, `city`, `state_or_region`, `postal_code`, `country` |
|
| 72 |
+
| **Government / tax IDs** | `government_id`, `national_id_number`, `passport_number`, `drivers_license_number`, `license_number`, `tax_id`, `tax_number` |
|
| 73 |
+
| **Banking / payment** | `bank_account`, `account_number`, `routing_number`, `iban`, `payment_card`, `card_number`, `card_expiry`, `card_cvv` |
|
| 74 |
+
| **Digital identity** | `username`, `ip_address`, `account_id`, `sensitive_account_id` |
|
| 75 |
+
| **Secrets / credentials** | `password`, `secret`, `api_key`, `access_token`, `recovery_code` |
|
| 76 |
+
| **Sensitive dates** | `sensitive_date`, `document_date`, `expiration_date`, `transaction_date` |
|
| 77 |
+
|
| 78 |
+
---
|
| 79 |
+
|
| 80 |
+
## Benchmark Results (SPY)
|
| 81 |
+
|
| 82 |
+
Evaluated on the [SPY benchmark](https://aclanthology.org/2025.naacl-srw.23/) (Savkin et al., 2025) with exact-match span-level metrics:
|
| 83 |
+
|
| 84 |
+
| Model | Legal P | Legal R | Legal F1 | Medical P | Medical R | Medical F1 | **Avg F1** |
|
| 85 |
+
|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
|
| 86 |
+
| **fastino/gliner2-pii-v1** | .346 | **.750** | **.473** | .369 | **.686** | **.480** | **.477** |
|
| 87 |
+
| nvidia/gliner-PII | .343 | .452 | .390 | .368 | .465 | .411 | .400 |
|
| 88 |
+
| urchade/gliner\_multi\_pii-v1 | **.467** | .317 | .377 | **.518** | .351 | .419 | .398 |
|
| 89 |
+
| openai/privacy-filter | .242 | .656 | .354 | .287 | .692 | .406 | .380 |
|
| 90 |
+
|
| 91 |
+
### Key takeaways
|
| 92 |
+
|
| 93 |
+
- **Highest F1** on both legal and medical domains.
|
| 94 |
+
- **Best recall** among GLiNER-based detectors (0.718 avg) โ critical for redaction workflows where missed spans are data leaks.
|
| 95 |
+
- Consistent performance across domains (< 2-point F1 difference).
|
| 96 |
+
|
| 97 |
+
---
|
| 98 |
+
|
| 99 |
+
## When to Use This Model
|
| 100 |
+
|
| 101 |
+
| Use case | Why GLiNER2-PII |
|
| 102 |
+
|---|---|
|
| 103 |
+
| **PII redaction / masking** | High recall minimises missed sensitive spans |
|
| 104 |
+
| **Data governance & GDPR/CCPA compliance** | 42 fine-grained types enable policy-specific routing |
|
| 105 |
+
| **Training-data hygiene** | Exact character spans for precise masking before model training |
|
| 106 |
+
| **Multi-language pipelines** | Trained on EN, FR, ES, DE, IT, PT, NL formats |
|
| 107 |
+
|
| 108 |
+
---
|
| 109 |
+
|
| 110 |
+
## Redaction Example
|
| 111 |
+
|
| 112 |
+
```python
|
| 113 |
+
def redact(text, labels, threshold=0.5):
|
| 114 |
+
model = GLiNER2.from_pretrained("fastino/gliner2-pii-v1")
|
| 115 |
+
result = model.extract_entities(
|
| 116 |
+
text, labels, threshold=threshold,
|
| 117 |
+
include_spans=True,
|
| 118 |
+
)
|
| 119 |
+
entities = result.get("entities", {})
|
| 120 |
+
spans = []
|
| 121 |
+
for label, values in entities.items():
|
| 122 |
+
for value in values:
|
| 123 |
+
start = text.find(value)
|
| 124 |
+
if start != -1:
|
| 125 |
+
spans.append((start, start + len(value), label))
|
| 126 |
+
|
| 127 |
+
spans.sort(key=lambda s: s[0], reverse=True)
|
| 128 |
+
redacted = text
|
| 129 |
+
for start, end, label in spans:
|
| 130 |
+
redacted = redacted[:start] + f"[{label.upper()}]" + redacted[end:]
|
| 131 |
+
return redacted
|
| 132 |
+
|
| 133 |
+
|
| 134 |
+
text = "Please contact Maria Jensen at maria.jensen@example.dk or +45 20 12 34 56."
|
| 135 |
+
labels = ["person", "email", "phone_number"]
|
| 136 |
+
print(redact(text, labels))
|
| 137 |
+
# "Please contact [PERSON] at [EMAIL] or [PHONE_NUMBER]."
|
| 138 |
+
```
|
| 139 |
+
|
| 140 |
+
---
|
| 141 |
+
|
| 142 |
+
## Training Details
|
| 143 |
+
|
| 144 |
+
| Detail | Value |
|
| 145 |
+
|---|---|
|
| 146 |
+
| Base model | GLiNER2 (205M parameters) |
|
| 147 |
+
| Training data | 4,910 synthetic annotated texts |
|
| 148 |
+
| PII mentions | 129,951 total (mean 26.5 per example) |
|
| 149 |
+
| Generator | GPT-5.4 (temperature 0.01) |
|
| 150 |
+
| Data framework | Constraint-driven generation (same framework as [Pioneer Agent](https://arxiv.org/abs/2604.09791)) |
|
| 151 |
+
| Languages | English, French, Spanish, German, Italian, Portuguese, Dutch |
|
| 152 |
+
| Label types | 42 PII entity types across 7 semantic groups |
|
| 153 |
+
|
| 154 |
+
---
|
| 155 |
+
|
| 156 |
+
## Limitations
|
| 157 |
+
|
| 158 |
+
- **Precision** (0.35โ0.37 on SPY) leaves room for improvement; the model tends to over-predict `name` entities, sometimes confusing common nouns, organisation names, and product names with personal names.
|
| 159 |
+
- Evaluated on a **single benchmark** (SPY) covering two domains. Broader multilingual and fine-grained evaluation is ongoing.
|
| 160 |
+
- Training data is **fully synthetic** and has not been validated by human annotators.
|
| 161 |
+
- Performance on **non-European** locales and scripts has not been measured.
|
| 162 |
+
|
| 163 |
+
### Improving precision
|
| 164 |
+
|
| 165 |
+
For production use, consider:
|
| 166 |
+
- Per-label confidence thresholds (raise threshold for `person` / `full_name`)
|
| 167 |
+
- Dictionary-based filtering for common false positives
|
| 168 |
+
- Calibration on a small domain-specific development set
|
| 169 |
+
|
| 170 |
+
---
|
| 171 |
+
|
| 172 |
+
## Citation
|
| 173 |
+
|
| 174 |
+
```bibtex
|
| 175 |
+
@misc{fastino2026gliner2pii,
|
| 176 |
+
title = {GLiNER2-PII: Multilingual PII Extraction via Synthetic Fine-Tuning},
|
| 177 |
+
author = {{Fastino AI Team}},
|
| 178 |
+
year = {2026},
|
| 179 |
+
url = {https://huggingface.co/fastino/gliner2-pii-v1}
|
| 180 |
+
}
|
| 181 |
+
```
|
| 182 |
+
|
| 183 |
+
### Related work
|
| 184 |
+
|
| 185 |
+
```bibtex
|
| 186 |
+
@inproceedings{zaratiana-etal-2025-gliner2,
|
| 187 |
+
title = {GLiNER2: Schema-Driven Multi-Task Learning for Structured Information Extraction},
|
| 188 |
+
author = {Zaratiana, Urchade and Pasternak, Gil and Boyd, Oliver and Hurn-Maloney, George and Lewis, Ash},
|
| 189 |
+
booktitle = {Proceedings of EMNLP 2025: System Demonstrations},
|
| 190 |
+
year = {2025}
|
| 191 |
+
}
|
| 192 |
+
|
| 193 |
+
@inproceedings{zaratiana-etal-2024-gliner,
|
| 194 |
+
title = {GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer},
|
| 195 |
+
author = {Zaratiana, Urchade and Tomeh, Nadi and Holat, Pierre and Charnois, Thierry},
|
| 196 |
+
booktitle = {Proceedings of NAACL 2024},
|
| 197 |
+
year = {2024}
|
| 198 |
+
}
|
| 199 |
+
|
| 200 |
+
@misc{atreja2026pioneeragent,
|
| 201 |
+
title = {Pioneer Agent: Continual Improvement of Small Language Models in Production},
|
| 202 |
+
author = {Atreja, Dhruv and White, Julia and Nayak, Nikhil and Zhang, Kelton and Princis, Henrijs and Hurn-Maloney, George and Lewis, Ash and Zaratiana, Urchade},
|
| 203 |
+
year = {2026},
|
| 204 |
+
url = {https://arxiv.org/abs/2604.09791}
|
| 205 |
+
}
|
| 206 |
+
```
|
| 207 |
+
|
| 208 |
+
---
|
| 209 |
+
|
| 210 |
+
## License
|
| 211 |
+
|
| 212 |
+
Apache 2.0
|