daxa-ai/qwen-synthetic-v1-ckpt-400 — Fine-tuned for PII Named Entity Recognition

Model Description

This model is a fine-tuned version of Qwen/Qwen3-4B-Instruct-2507 trained to extract PII entities from unstructured text. Given an input text, the model outputs a structured JSON object containing all detected PII entities, organized by entity type. Entity types not present in the input are returned as empty arrays.

The model was trained on the daxa-ai/synthetic-pii-dataset dataset processed into a conversational (chat template) format compatible with Qwen's <|im_start|> / <|im_end|> instruction format.

Model Overview

Field	Details
Base Model	Qwen/Qwen3-4B-Instruct-2507
Task	Named Entity Recognition (NER) – Personally Identifiable Information (PII)
Fine-tuning Method	Supervised Fine-Tuning (SFT) with LoRA (PEFT)
Training Framework	TRL SFTTrainer + HuggingFace Transformers + PEFT

Target Entity Types

The model is trained to detect the following PII entity types:

CREDIT_CARD, US_SSN, EMAIL, PHONE, DATE_OF_BIRTH, IP_ADDRESS, MEDICAL_RECORD_NUMBER, BANK_ROUTING_NUMBER, LICENSE_PLATE, IBAN, SWIFT, BBAN, US_BANK_ACCOUNT, VEHICLE_VIN, US_PASSPORT, US_DRIVERS_LICENSE, HEALTH_INSURANCE_NUMBER, INDIA_AADHAAR, AADHAR_ID, INDIA_PAN, US_ITIN, GITHUB_TOKEN, AWS_ACCESS_KEY, AZURE_KEY_ID, SLACK_TOKEN, HONG_KONG_ID

Input / Output Format

The model expects a system prompt followed by a user message containing the raw text to analyze.

System Prompt:

You are a Named Entity Recognition assistant. Extract the following entities from the input text and output as JSON:

- <entity_type_1>
- <entity_type_2>
- ...

IMPORTANT RULES:
- Always include ALL entity keys in your output
- Use empty arrays [] for entity types that are not found in the text
- Extract the exact span for each entity (only the entity value — no start/end offsets)
- Output valid JSON only

Example Input:

John Smith lives at 123 Main Street and his email is john.smith@email.com.

Example Output:

{
  "EMAIL": ["john.smith@email.com"],
  "PHONE": [],
  "US_SSN": [],
  ...
}

Training Details

Dataset

Split	Samples
Train	7,600
Test / Eval	2,000

Source: daxa-ai/synthetic-pii-dataset

Each sample is converted to a 3-turn messages format: system → user → assistant (JSON string).

Hyperparameters

Parameter	Value
Learning Rate	2e-4
Epochs	3
Per-device Train Batch Size	2
Gradient Accumulation Steps	8
Effective Batch Size	16
Max Sequence Length	2048
LR Scheduler	Cosine
Warmup Ratio	0.1
Weight Decay	0.01
Optimizer	adamw_torch_fused
Max Grad Norm	1.0
Precision	BF16
Gradient Checkpointing	Enabled

LoRA Configuration

Parameter	Value
Rank (r)	64
Alpha	128
Dropout	0.1
Target Modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Bias	None
Task Type	CAUSAL_LM

Evaluation

Evaluated on the 2,000-sample test split using seqeval.

Metric	Score
Precision	0.9658
Recall	0.9272
F1	0.9461

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import json

model_id = "daxa-ai/qwen-synthetic-v1-ckpt-400"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

system_prompt = """You are a Named Entity Recognition assistant. Extract the following entities from the input text and output as JSON:

- CREDIT_CARD
- US_SSN
- EMAIL
- PHONE
- DATE_OF_BIRTH
- IP_ADDRESS
- MEDICAL_RECORD_NUMBER
- BANK_ROUTING_NUMBER
- LICENSE_PLATE
- IBAN
- SWIFT
- BBAN
- US_BANK_ACCOUNT
- VEHICLE_VIN
- US_PASSPORT
- US_DRIVERS_LICENSE
- HEALTH_INSURANCE_NUMBER
- INDIA_AADHAAR
- AADHAR_ID
- INDIA_PAN
- US_ITIN
- GITHUB_TOKEN
- AWS_ACCESS_KEY
- AZURE_KEY_ID
- SLACK_TOKEN
- HONG_KONG_ID

IMPORTANT RULES:
- Always include ALL entity keys in your output
- Use empty arrays [] for entity types that are not found in the text
- Extract the exact span for each entity (only the entity value — no start/end offsets)
- Output valid JSON only"""

text = "Your input text here."

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": text},
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
output = model.generate(input_ids, max_new_tokens=512, do_sample=False)
response = tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True)
result = json.loads(response)
print(result)

Acknowledgements

Base model: Qwen3-4B-Instruct-2507 by Alibaba Cloud
Training data: daxa-ai/synthetic-pii-dataset

Downloads last month: 1

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support