YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
daxa-ai/qwen-synthetic-v1-ckpt-400 β Fine-tuned for PII Named Entity Recognition
Model Description
This model is a fine-tuned version of Qwen/Qwen3-4B-Instruct-2507 trained to extract PII entities from unstructured text. Given an input text, the model outputs a structured JSON object containing all detected PII entities, organized by entity type. Entity types not present in the input are returned as empty arrays.
The model was trained on the daxa-ai/synthetic-pii-dataset dataset processed into a conversational (chat template) format compatible with Qwen's <|im_start|> / <|im_end|> instruction format.
Model Overview
| Field | Details |
|---|---|
| Base Model | Qwen/Qwen3-4B-Instruct-2507 |
| Task | Named Entity Recognition (NER) β Personally Identifiable Information (PII) |
| Fine-tuning Method | Supervised Fine-Tuning (SFT) with LoRA (PEFT) |
| Training Framework | TRL SFTTrainer + HuggingFace Transformers + PEFT |
Target Entity Types
The model is trained to detect the following PII entity types:
CREDIT_CARD, US_SSN, EMAIL, PHONE, DATE_OF_BIRTH, IP_ADDRESS, MEDICAL_RECORD_NUMBER, BANK_ROUTING_NUMBER, LICENSE_PLATE, IBAN, SWIFT, BBAN, US_BANK_ACCOUNT, VEHICLE_VIN, US_PASSPORT, US_DRIVERS_LICENSE, HEALTH_INSURANCE_NUMBER, INDIA_AADHAAR, AADHAR_ID, INDIA_PAN, US_ITIN, GITHUB_TOKEN, AWS_ACCESS_KEY, AZURE_KEY_ID, SLACK_TOKEN, HONG_KONG_ID
Input / Output Format
The model expects a system prompt followed by a user message containing the raw text to analyze.
System Prompt:
You are a Named Entity Recognition assistant. Extract the following entities from the input text and output as JSON:
- <entity_type_1>
- <entity_type_2>
- ...
IMPORTANT RULES:
- Always include ALL entity keys in your output
- Use empty arrays [] for entity types that are not found in the text
- Extract the exact span for each entity (only the entity value β no start/end offsets)
- Output valid JSON only
Example Input:
John Smith lives at 123 Main Street and his email is john.smith@email.com.
Example Output:
{
"EMAIL": ["john.smith@email.com"],
"PHONE": [],
"US_SSN": [],
...
}
Training Details
Dataset
| Split | Samples |
|---|---|
| Train | 7,600 |
| Test / Eval | 2,000 |
Source: daxa-ai/synthetic-pii-dataset
Each sample is converted to a 3-turn messages format: system β user β assistant (JSON string).
Hyperparameters
| Parameter | Value |
|---|---|
| Learning Rate | 2e-4 |
| Epochs | 3 |
| Per-device Train Batch Size | 2 |
| Gradient Accumulation Steps | 8 |
| Effective Batch Size | 16 |
| Max Sequence Length | 2048 |
| LR Scheduler | Cosine |
| Warmup Ratio | 0.1 |
| Weight Decay | 0.01 |
| Optimizer | adamw_torch_fused |
| Max Grad Norm | 1.0 |
| Precision | BF16 |
| Gradient Checkpointing | Enabled |
LoRA Configuration
| Parameter | Value |
|---|---|
| Rank (r) | 64 |
| Alpha | 128 |
| Dropout | 0.1 |
| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Bias | None |
| Task Type | CAUSAL_LM |
Evaluation
Evaluated on the 2,000-sample test split using seqeval.
| Metric | Score |
|---|---|
| Precision | 0.9658 |
| Recall | 0.9272 |
| F1 | 0.9461 |
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import json
model_id = "daxa-ai/qwen-synthetic-v1-ckpt-400"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
system_prompt = """You are a Named Entity Recognition assistant. Extract the following entities from the input text and output as JSON:
- CREDIT_CARD
- US_SSN
- EMAIL
- PHONE
- DATE_OF_BIRTH
- IP_ADDRESS
- MEDICAL_RECORD_NUMBER
- BANK_ROUTING_NUMBER
- LICENSE_PLATE
- IBAN
- SWIFT
- BBAN
- US_BANK_ACCOUNT
- VEHICLE_VIN
- US_PASSPORT
- US_DRIVERS_LICENSE
- HEALTH_INSURANCE_NUMBER
- INDIA_AADHAAR
- AADHAR_ID
- INDIA_PAN
- US_ITIN
- GITHUB_TOKEN
- AWS_ACCESS_KEY
- AZURE_KEY_ID
- SLACK_TOKEN
- HONG_KONG_ID
IMPORTANT RULES:
- Always include ALL entity keys in your output
- Use empty arrays [] for entity types that are not found in the text
- Extract the exact span for each entity (only the entity value β no start/end offsets)
- Output valid JSON only"""
text = "Your input text here."
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": text},
]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
output = model.generate(input_ids, max_new_tokens=512, do_sample=False)
response = tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True)
result = json.loads(response)
print(result)
Acknowledgements
- Base model: Qwen3-4B-Instruct-2507 by Alibaba Cloud
- Training data: daxa-ai/synthetic-pii-dataset
- Downloads last month
- 1