GLiNER PII Polish - Fine-tuned Model for Polish Personal Identifiable Information Detection

Model Description

This model is a fine-tuned version of urchade/gliner_multi-v2.1 specifically optimized for detecting Personal Identifiable Information (PII) in Polish texts. The model has been trained to recognize 13 core PII entity types commonly found in Polish documents, forms, and administrative texts.

Key Features:

  • Specialized for Polish language PII detection
  • Detects 13+ entity types including names, PESEL, NIP, addresses, and more
  • Fine-tuned on domain-specific Polish PII data
  • High precision on structured and unstructured Polish text

Model Details

Base Model

  • Base Model: urchade/gliner_multi-v2.1
  • Architecture: GLiNER (Generalist and Lightweight model for Named Entity Recognition)
  • Task: Named Entity Recognition (NER) / PII Detection

Entity Labels

The model detects the following PII entity types:

Category Entity Types
Personal Information name, surname, age, sex
Location city, address, postal_code
Contact phone, email
Identity Documents pesel, document_number, id_number
Financial bank_account, contract_number
Organizational company
Temporal date
Digital username, social_media_handle
Sensitive medical_condition, religion, nationality, political_view, sexual_orientation

Training Details

Training Infrastructure

  • Platform: Google Compute Engine
  • GPU: NVIDIA A100
  • Python Version: Python 3
  • Training Time: ~1.5 hours

Training Configuration

Hyperparameter Value
Epochs 5
Batch Size 8
Learning Rate 5e-6
Max Sequence Length 384
Validation Split 10%
Optimizer AdamW (via GLiNER)
Evaluation Strategy Per epoch
Save Strategy Per epoch

Training Data

  • Format: JSONL with tokenized text and entity spans
  • Source: Polish administrative and official documents
  • Annotation: Manually annotated PII entities with precise span boundaries
  • Preprocessing: Tokenization, entity alignment, format conversion to GLiNER training format

Training Process

The model was fine-tuned using the GLiNER training API with the following approach:

  1. Data Preparation: Training samples were converted from aligned original/anonymized text pairs to GLiNER's expected format with entity spans and labels
  2. Model Initialization: Loaded pre-trained gliner_multi-v2.1 weights
  3. Fine-tuning: Trained for 5 epochs with validation after each epoch
  4. Checkpointing: Saved model checkpoints after each epoch for best model selection

Usage

Installation

pip install gliner

Basic Usage

from gliner import GLiNER

Load the fine-tuned model

model = GLiNER.from_pretrained("piotrmaciejbednarski/gliner-pii-polish")

Define entity labels to detect

labels = [ "name", "surname", "city", "address", "company", "phone", "email", "pesel", "date", "document_number", "bank_account", "age", "sex" ]

Detect entities in Polish text

text = "Jan Kowalski mieszka w Warszawie przy ul. Marszałkowska 15. Jego PESEL to 90010112345." entities = model.predict_entities(text, labels, threshold=0.5)

Print results

for entity in entities: print(f"{entity['text']} -> {entity['label']} (confidence: {entit

Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for piotrmaciejbednarski/gliner-pii-polish

Finetuned
(11)
this model