Qwen3-4B PII NER — LoRA Fine-tuned for PII Entity Extraction
A fine-tuned version of Qwen/Qwen3-4B-Instruct-2507 trained to extract Personally Identifiable Information (PII) from unstructured text. The model outputs a structured JSON object containing detected entities, organized by type.
Trained on the DAXAAI-Research/synthetic-pii-dataset-nemotron-split dataset using LoRA adapters via TRL's SFTTrainer.
Model Overview
| Field | Details |
|---|---|
| Base Model | Qwen/Qwen3-4B-Instruct-2507 |
| Task | Named Entity Recognition (NER) — PII Detection |
| Method | Supervised Fine-Tuning (SFT) with LoRA (PEFT) |
| Framework | TRL SFTTrainer + HuggingFace Transformers + PEFT |
| Language | English |
| License | Apache 2.0 |
| Dataset | DAXAAI-Research/synthetic-pii-dataset-nemotron-split |
| W&B Run | View training run |
Target Entity Types (21)
BBAN_CODE, CREDIT_CARD, DATE_OF_BIRTH, EMAIL_ADDRESS, HEALTH_INSURANCE_NUMBER, HONG_KONG_ID, IBAN_CODE, INDIA_AADHAAR, INDIA_PAN, IP_ADDRESS, LICENSE_PLATE_NUMBER, MEDICAL_RECORD_NUMBER, PHONE_NUMBER, ROUTING_NUMBER, SWIFT_CODE, US_BANK_NUMBER, US_DRIVER_LICENSE, US_ITIN, US_PASSPORT, US_SSN, VEHICLE_VIN
Training Configuration
Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | ~1.32 (380 steps) |
| Batch Size (per device) | 10 |
| Gradient Accumulation Steps | 3 |
| Effective Batch Size | 30 |
| Learning Rate | 1e-5 |
| LR Scheduler | Cosine |
| Warmup Ratio | 0.1 |
| Weight Decay | 0.01 |
| Precision | BF16 |
| Max Sequence Length | 8,192 |
| Optimizer | AdamW (β1=0.9, β2=0.999, ε=1e-8) |
| Loss | Assistant-only (SFT) |
LoRA Configuration
| Parameter | Value |
|---|---|
Rank (r) |
256 |
| Alpha | 128 |
| Dropout | 0.05 |
| Target Modules | q_proj, k_proj, v_proj, o_proj |
| Task Type | CAUSAL_LM |
Training Results
| Metric | Train | Eval |
|---|---|---|
| Loss | 0.0144 | 0.0239 |
| Mean Token Accuracy | 99.6% | 99.4% |
| Entropy | 1.086 | 1.072 |
System Prompt
The model expects the following system prompt at inference time:
You are a Named Entity Recognition assistant. Extract the following entities from the input text and output as JSON.
Output format: a JSON object with entity types as keys and arrays of extracted values. Do NOT include character positions, start/end indices, or spans — only entity types and their values.
Entity types to extract:
- BBAN_CODE
- CREDIT_CARD
- DATE_OF_BIRTH
- EMAIL_ADDRESS
- HEALTH_INSURANCE_NUMBER
- HONG_KONG_ID
- IBAN_CODE
- INDIA_AADHAAR
- INDIA_PAN
- IP_ADDRESS
- LICENSE_PLATE_NUMBER
- MEDICAL_RECORD_NUMBER
- PHONE_NUMBER
- ROUTING_NUMBER
- SWIFT_CODE
- US_BANK_NUMBER
- US_DRIVER_LICENSE
- US_ITIN
- US_PASSPORT
- US_SSN
- VEHICLE_VIN
IMPORTANT RULES:
- Always include ALL entity keys in your output
- Use empty arrays [] for entity types not found in the text
- Extract the exact entity values as they appear in the text
- Do not infer or guess entities that are not explicitly present
- Output valid JSON only (entity types + values, no positions or indices)
Output schema (always include all keys, use empty arrays for missing entities):
{
"BBAN_CODE": [],
"CREDIT_CARD": [],
"DATE_OF_BIRTH": [],
"EMAIL_ADDRESS": [],
"HEALTH_INSURANCE_NUMBER": [],
"HONG_KONG_ID": [],
"IBAN_CODE": [],
"INDIA_AADHAAR": [],
"INDIA_PAN": [],
"IP_ADDRESS": [],
"LICENSE_PLATE_NUMBER": [],
"MEDICAL_RECORD_NUMBER": [],
"PHONE_NUMBER": [],
"ROUTING_NUMBER": [],
"SWIFT_CODE": [],
"US_BANK_NUMBER": [],
"US_DRIVER_LICENSE": [],
"US_ITIN": [],
"US_PASSPORT": [],
"US_SSN": [],
"VEHICLE_VIN": []
}
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
model_id = "DAXAAI-Research/qwen-pii-ner-adapters-v4-sparse"
base_model_id = "Qwen/Qwen3-4B-Instruct-2507"
# Load tokenizer from the adapter repo (contains any vocab changes made during fine-tuning)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Load base model, then overlay the LoRA adapter weights via PeftModel
model = AutoModelForCausalLM.from_pretrained(base_model_id)
model = PeftModel.from_pretrained(model, model_id)
model.eval()
text = "Contact John at john.doe@example.com or 555-123-4567. His SSN is 123-45-6789."
messages = [
{"role": "system", "content": "<system prompt above>"},
{"role": "user", "content": text},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs.to(model.device), max_new_tokens=1500, temperature=0.0, do_sample=False)
result = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(result)
# {"EMAIL_ADDRESS": ["john.doe@example.com"], "PHONE_NUMBER": ["555-123-4567"], "US_SSN": ["123-45-6789"]}
Links
- W&B: https://wandb.ai/daxa/qwen-dft-ner/runs/l36cn9id
- Dataset: https://huggingface.co/datasets/DAXAAI-Research/synthetic-pii-dataset-nemotron-split
Framework Versions
| Library | Version |
|---|---|
| TRL | 0.29.1 |
| Transformers | 5.3.0 |
| PyTorch | 2.10.0 |
| Datasets | 4.8.4 |
| Tokenizers | 0.22.2 |
| PEFT | (bundled with TRL) |
Model tree for DAXAAI-Research/qwen4b-nemotron-adapters
Base model
Qwen/Qwen3-4B-Instruct-2507