GLiNER PII Polish - Fine-tuned Model for Polish Personal Identifiable Information Detection
Model Description
This model is a fine-tuned version of urchade/gliner_multi-v2.1 specifically optimized for detecting Personal Identifiable Information (PII) in Polish texts. The model has been trained to recognize 13 core PII entity types commonly found in Polish documents, forms, and administrative texts.
Key Features:
- Specialized for Polish language PII detection
- Detects 13+ entity types including names, PESEL, NIP, addresses, and more
- Fine-tuned on domain-specific Polish PII data
- High precision on structured and unstructured Polish text
Model Details
Base Model
- Base Model:
urchade/gliner_multi-v2.1 - Architecture: GLiNER (Generalist and Lightweight model for Named Entity Recognition)
- Task: Named Entity Recognition (NER) / PII Detection
Entity Labels
The model detects the following PII entity types:
| Category | Entity Types |
|---|---|
| Personal Information | name, surname, age, sex |
| Location | city, address, postal_code |
| Contact | phone, email |
| Identity Documents | pesel, document_number, id_number |
| Financial | bank_account, contract_number |
| Organizational | company |
| Temporal | date |
| Digital | username, social_media_handle |
| Sensitive | medical_condition, religion, nationality, political_view, sexual_orientation |
Training Details
Training Infrastructure
- Platform: Google Compute Engine
- GPU: NVIDIA A100
- Python Version: Python 3
- Training Time: ~1.5 hours
Training Configuration
| Hyperparameter | Value |
|---|---|
| Epochs | 5 |
| Batch Size | 8 |
| Learning Rate | 5e-6 |
| Max Sequence Length | 384 |
| Validation Split | 10% |
| Optimizer | AdamW (via GLiNER) |
| Evaluation Strategy | Per epoch |
| Save Strategy | Per epoch |
Training Data
- Format: JSONL with tokenized text and entity spans
- Source: Polish administrative and official documents
- Annotation: Manually annotated PII entities with precise span boundaries
- Preprocessing: Tokenization, entity alignment, format conversion to GLiNER training format
Training Process
The model was fine-tuned using the GLiNER training API with the following approach:
- Data Preparation: Training samples were converted from aligned original/anonymized text pairs to GLiNER's expected format with entity spans and labels
- Model Initialization: Loaded pre-trained
gliner_multi-v2.1weights - Fine-tuning: Trained for 5 epochs with validation after each epoch
- Checkpointing: Saved model checkpoints after each epoch for best model selection
Usage
Installation
pip install gliner
Basic Usage
from gliner import GLiNER
Load the fine-tuned model
model = GLiNER.from_pretrained("piotrmaciejbednarski/gliner-pii-polish")
Define entity labels to detect
labels = [ "name", "surname", "city", "address", "company", "phone", "email", "pesel", "date", "document_number", "bank_account", "age", "sex" ]
Detect entities in Polish text
text = "Jan Kowalski mieszka w Warszawie przy ul. Marszałkowska 15. Jego PESEL to 90010112345." entities = model.predict_entities(text, labels, threshold=0.5)
Print results
for entity in entities: print(f"{entity['text']} -> {entity['label']} (confidence: {entit
- Downloads last month
- 12
Model tree for piotrmaciejbednarski/gliner-pii-polish
Base model
urchade/gliner_multi-v2.1