PII Detection — DeBERTa-v3-base (Multi-Dataset)

A token classification (NER) model for Personally Identifiable Information (PII) detection, fine-tuned on a combination of ai4privacy/pii-masking-200k and nvidia/Nemotron-PII datasets.

Model Description

  • Base model: microsoft/deberta-v3-base
  • Task: Token Classification (BIO tagging)
  • Loss: Focal Loss (alpha=1.0, gamma=2.0)
  • Parameters: 183M

Supported Entity Types (7 types, Kaggle PII standard)

Entity Description F1
NAME_STUDENT Person names 0.979
EMAIL Email addresses 0.992
USERNAME Usernames 0.980
ID_NUM ID numbers (SSN, credit card, passport, etc.) 0.980
PHONE_NUM Phone numbers 0.992
URL_PERSONAL Personal URLs 0.992
STREET_ADDRESS Street addresses 0.958

Usage

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model_name = "seongyeon1/pii-deberta-v3-base-multi"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

text = "My name is John Smith and my email is john@example.com"
inputs = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
offset_mapping = inputs.pop("offset_mapping")

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)[0]

for idx, (pred, (start, end)) in enumerate(zip(predictions, offset_mapping[0])):
    label = model.config.id2label[pred.item()]
    if label != "O" and start != 0 and end != 0:
        print(f"{text[start:end]} -> {label}")

Training Details

Setting Value
Epochs 3
Learning Rate 2e-5
Batch Size 8
Max Length 256
Warmup Steps 200
Optimizer AdamW
Training Data ~9,000 samples (ai4privacy 5K + Nemotron 5K)
Training Time ~29 minutes (Apple MPS)

Label Merge Strategy

54+ entity types from source datasets are merged into 7 standard Kaggle PII types:

  • FIRSTNAME, LASTNAME, GIVENNAME, SURNAME → NAME_STUDENT
  • SSN, CREDITCARDNUMBER, IBAN, PASSPORT, ... → ID_NUM
  • PHONENUMBER, TEL, TELEPHONENUM → PHONE_NUM
  • CITY, STATE, ZIPCODE, STREET → STREET_ADDRESS

Evaluation Results

Metric Score
F1 (entity-level) 0.9766
Precision 0.9703
Recall 0.9830
F5 (recall-weighted) 0.9825
Eval Loss 0.0014

Training Loss Curve

Epoch Train Loss Eval F1 Eval F5
1 0.0039 0.9434 0.9560
2 0.0023 0.9738 0.9777
3 0.0012 0.9766 0.9825

Framework

Built with PolyBed-Pipeline PII Channel.

Downloads last month
55
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train seongyeon1/pii-deberta-v3-base-multi

Evaluation results