Qwen3-4B PII NER — LoRA Fine-tuned for PII Entity Extraction

A fine-tuned version of Qwen/Qwen3-4B-Instruct-2507 trained to extract Personally Identifiable Information (PII) from unstructured text. The model outputs a structured JSON object containing detected entities, organized by type.

Trained on the DAXAAI-Research/synthetic-pii-dataset-nemotron-split dataset using LoRA adapters via TRL's SFTTrainer.

Model Overview

Field	Details
Base Model	`Qwen/Qwen3-4B-Instruct-2507`
Task	Named Entity Recognition (NER) — PII Detection
Method	Supervised Fine-Tuning (SFT) with LoRA (PEFT)
Framework	TRL `SFTTrainer` + HuggingFace Transformers + PEFT
Language	English
License	Apache 2.0
Dataset	DAXAAI-Research/synthetic-pii-dataset-nemotron-split
W&B Run	View training run

Target Entity Types (21)

BBAN_CODE, CREDIT_CARD, DATE_OF_BIRTH, EMAIL_ADDRESS, HEALTH_INSURANCE_NUMBER, HONG_KONG_ID, IBAN_CODE, INDIA_AADHAAR, INDIA_PAN, IP_ADDRESS, LICENSE_PLATE_NUMBER, MEDICAL_RECORD_NUMBER, PHONE_NUMBER, ROUTING_NUMBER, SWIFT_CODE, US_BANK_NUMBER, US_DRIVER_LICENSE, US_ITIN, US_PASSPORT, US_SSN, VEHICLE_VIN

Training Configuration

Hyperparameters

Parameter	Value
Epochs	~1.32 (380 steps)
Batch Size (per device)	10
Gradient Accumulation Steps	3
Effective Batch Size	30
Learning Rate	1e-5
LR Scheduler	Cosine
Warmup Ratio	0.1
Weight Decay	0.01
Precision	BF16
Max Sequence Length	8,192
Optimizer	AdamW (β1=0.9, β2=0.999, ε=1e-8)
Loss	Assistant-only (SFT)

LoRA Configuration

Parameter	Value
Rank (`r`)	256
Alpha	128
Dropout	0.05
Target Modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`
Task Type	`CAUSAL_LM`

Training Results

Metric	Train	Eval
Loss	0.0144	0.0239
Mean Token Accuracy	99.6%	99.4%
Entropy	1.086	1.072

System Prompt

The model expects the following system prompt at inference time:

You are a Named Entity Recognition assistant. Extract the following entities from the input text and output as JSON.

Output format: a JSON object with entity types as keys and arrays of extracted values. Do NOT include character positions, start/end indices, or spans — only entity types and their values.

Entity types to extract:
- BBAN_CODE
- CREDIT_CARD
- DATE_OF_BIRTH
- EMAIL_ADDRESS
- HEALTH_INSURANCE_NUMBER
- HONG_KONG_ID
- IBAN_CODE
- INDIA_AADHAAR
- INDIA_PAN
- IP_ADDRESS
- LICENSE_PLATE_NUMBER
- MEDICAL_RECORD_NUMBER
- PHONE_NUMBER
- ROUTING_NUMBER
- SWIFT_CODE
- US_BANK_NUMBER
- US_DRIVER_LICENSE
- US_ITIN
- US_PASSPORT
- US_SSN
- VEHICLE_VIN

IMPORTANT RULES:
- Always include ALL entity keys in your output
- Use empty arrays [] for entity types not found in the text
- Extract the exact entity values as they appear in the text
- Do not infer or guess entities that are not explicitly present
- Output valid JSON only (entity types + values, no positions or indices)

Output schema (always include all keys, use empty arrays for missing entities):
{
  "BBAN_CODE": [],
  "CREDIT_CARD": [],
  "DATE_OF_BIRTH": [],
  "EMAIL_ADDRESS": [],
  "HEALTH_INSURANCE_NUMBER": [],
  "HONG_KONG_ID": [],
  "IBAN_CODE": [],
  "INDIA_AADHAAR": [],
  "INDIA_PAN": [],
  "IP_ADDRESS": [],
  "LICENSE_PLATE_NUMBER": [],
  "MEDICAL_RECORD_NUMBER": [],
  "PHONE_NUMBER": [],
  "ROUTING_NUMBER": [],
  "SWIFT_CODE": [],
  "US_BANK_NUMBER": [],
  "US_DRIVER_LICENSE": [],
  "US_ITIN": [],
  "US_PASSPORT": [],
  "US_SSN": [],
  "VEHICLE_VIN": []
}

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

model_id = "DAXAAI-Research/qwen-pii-ner-adapters-v4-sparse"
base_model_id = "Qwen/Qwen3-4B-Instruct-2507"

# Load tokenizer from the adapter repo (contains any vocab changes made during fine-tuning)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load base model, then overlay the LoRA adapter weights via PeftModel
model = AutoModelForCausalLM.from_pretrained(base_model_id)
model = PeftModel.from_pretrained(model, model_id)
model.eval()

text = "Contact John at john.doe@example.com or 555-123-4567. His SSN is 123-45-6789."

messages = [
    {"role": "system", "content": "<system prompt above>"},
    {"role": "user", "content": text},
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs.to(model.device), max_new_tokens=1500, temperature=0.0, do_sample=False)
result = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(result)
# {"EMAIL_ADDRESS": ["john.doe@example.com"], "PHONE_NUMBER": ["555-123-4567"], "US_SSN": ["123-45-6789"]}

Framework Versions

Library	Version
TRL	0.29.1
Transformers	5.3.0
PyTorch	2.10.0
Datasets	4.8.4
Tokenizers	0.22.2
PEFT	(bundled with TRL)

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for DAXAAI-Research/qwen4b-nemotron-adapters

Base model

Qwen/Qwen3-4B-Instruct-2507

Adapter

(5268)

this model

DAXAAI-Research
/

qwen4b-nemotron-adapters