GLiNER PII Polish - Fine-tuned Model for Polish Personal Identifiable Information Detection

Model Description

This model is a fine-tuned version of urchade/gliner_multi-v2.1 specifically optimized for detecting Personal Identifiable Information (PII) in Polish texts. The model has been trained to recognize 13 core PII entity types commonly found in Polish documents, forms, and administrative texts.

Key Features:

Specialized for Polish language PII detection
Detects 13+ entity types including names, PESEL, NIP, addresses, and more
Fine-tuned on domain-specific Polish PII data
High precision on structured and unstructured Polish text

Model Details

Base Model

Base Model: urchade/gliner_multi-v2.1
Architecture: GLiNER (Generalist and Lightweight model for Named Entity Recognition)
Task: Named Entity Recognition (NER) / PII Detection

Entity Labels

The model detects the following PII entity types:

Category	Entity Types
Personal Information	`name`, `surname`, `age`, `sex`
Location	`city`, `address`, `postal_code`
Contact	`phone`, `email`
Identity Documents	`pesel`, `document_number`, `id_number`
Financial	`bank_account`, `contract_number`
Organizational	`company`
Temporal	`date`
Digital	`username`, `social_media_handle`
Sensitive	`medical_condition`, `religion`, `nationality`, `political_view`, `sexual_orientation`

Training Details

Training Infrastructure

Platform: Google Compute Engine
GPU: NVIDIA A100
Python Version: Python 3
Training Time: ~1.5 hours

Training Configuration

Hyperparameter	Value
Epochs	5
Batch Size	8
Learning Rate	5e-6
Max Sequence Length	384
Validation Split	10%
Optimizer	AdamW (via GLiNER)
Evaluation Strategy	Per epoch
Save Strategy	Per epoch

Training Data

Format: JSONL with tokenized text and entity spans
Source: Polish administrative and official documents
Annotation: Manually annotated PII entities with precise span boundaries
Preprocessing: Tokenization, entity alignment, format conversion to GLiNER training format

Training Process

The model was fine-tuned using the GLiNER training API with the following approach:

Data Preparation: Training samples were converted from aligned original/anonymized text pairs to GLiNER's expected format with entity spans and labels
Model Initialization: Loaded pre-trained gliner_multi-v2.1 weights
Fine-tuning: Trained for 5 epochs with validation after each epoch
Checkpointing: Saved model checkpoints after each epoch for best model selection

Usage

Installation

pip install gliner

Basic Usage

from gliner import GLiNER

Load the fine-tuned model

model = GLiNER.from_pretrained("piotrmaciejbednarski/gliner-pii-polish")

Define entity labels to detect

labels = [ "name", "surname", "city", "address", "company", "phone", "email", "pesel", "date", "document_number", "bank_account", "age", "sex" ]

Detect entities in Polish text

text = "Jan Kowalski mieszka w Warszawie przy ul. Marszałkowska 15. Jego PESEL to 90010112345." entities = model.predict_entities(text, labels, threshold=0.5)

Print results

for entity in entities: print(f"{entity['text']} -> {entity['label']} (confidence: {entit

Downloads last month: 12

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for piotrmaciejbednarski/gliner-pii-polish

Base model

urchade/gliner_multi-v2.1

Finetuned

(11)

this model