YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

daxa-ai/qwen-synthetic-v1-ckpt-400 β€” Fine-tuned for PII Named Entity Recognition

Model Description

This model is a fine-tuned version of Qwen/Qwen3-4B-Instruct-2507 trained to extract PII entities from unstructured text. Given an input text, the model outputs a structured JSON object containing all detected PII entities, organized by entity type. Entity types not present in the input are returned as empty arrays.

The model was trained on the daxa-ai/synthetic-pii-dataset dataset processed into a conversational (chat template) format compatible with Qwen's <|im_start|> / <|im_end|> instruction format.


Model Overview

Field Details
Base Model Qwen/Qwen3-4B-Instruct-2507
Task Named Entity Recognition (NER) – Personally Identifiable Information (PII)
Fine-tuning Method Supervised Fine-Tuning (SFT) with LoRA (PEFT)
Training Framework TRL SFTTrainer + HuggingFace Transformers + PEFT

Target Entity Types

The model is trained to detect the following PII entity types:

CREDIT_CARD, US_SSN, EMAIL, PHONE, DATE_OF_BIRTH, IP_ADDRESS, MEDICAL_RECORD_NUMBER, BANK_ROUTING_NUMBER, LICENSE_PLATE, IBAN, SWIFT, BBAN, US_BANK_ACCOUNT, VEHICLE_VIN, US_PASSPORT, US_DRIVERS_LICENSE, HEALTH_INSURANCE_NUMBER, INDIA_AADHAAR, AADHAR_ID, INDIA_PAN, US_ITIN, GITHUB_TOKEN, AWS_ACCESS_KEY, AZURE_KEY_ID, SLACK_TOKEN, HONG_KONG_ID


Input / Output Format

The model expects a system prompt followed by a user message containing the raw text to analyze.

System Prompt:

You are a Named Entity Recognition assistant. Extract the following entities from the input text and output as JSON:

- <entity_type_1>
- <entity_type_2>
- ...

IMPORTANT RULES:
- Always include ALL entity keys in your output
- Use empty arrays [] for entity types that are not found in the text
- Extract the exact span for each entity (only the entity value β€” no start/end offsets)
- Output valid JSON only

Example Input:

John Smith lives at 123 Main Street and his email is john.smith@email.com.

Example Output:

{
  "EMAIL": ["john.smith@email.com"],
  "PHONE": [],
  "US_SSN": [],
  ...
}

Training Details

Dataset

Split Samples
Train 7,600
Test / Eval 2,000

Source: daxa-ai/synthetic-pii-dataset

Each sample is converted to a 3-turn messages format: system β†’ user β†’ assistant (JSON string).

Hyperparameters

Parameter Value
Learning Rate 2e-4
Epochs 3
Per-device Train Batch Size 2
Gradient Accumulation Steps 8
Effective Batch Size 16
Max Sequence Length 2048
LR Scheduler Cosine
Warmup Ratio 0.1
Weight Decay 0.01
Optimizer adamw_torch_fused
Max Grad Norm 1.0
Precision BF16
Gradient Checkpointing Enabled

LoRA Configuration

Parameter Value
Rank (r) 64
Alpha 128
Dropout 0.1
Target Modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Bias None
Task Type CAUSAL_LM

Evaluation

Evaluated on the 2,000-sample test split using seqeval.

Metric Score
Precision 0.9658
Recall 0.9272
F1 0.9461

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import json

model_id = "daxa-ai/qwen-synthetic-v1-ckpt-400"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")

system_prompt = """You are a Named Entity Recognition assistant. Extract the following entities from the input text and output as JSON:

- CREDIT_CARD
- US_SSN
- EMAIL
- PHONE
- DATE_OF_BIRTH
- IP_ADDRESS
- MEDICAL_RECORD_NUMBER
- BANK_ROUTING_NUMBER
- LICENSE_PLATE
- IBAN
- SWIFT
- BBAN
- US_BANK_ACCOUNT
- VEHICLE_VIN
- US_PASSPORT
- US_DRIVERS_LICENSE
- HEALTH_INSURANCE_NUMBER
- INDIA_AADHAAR
- AADHAR_ID
- INDIA_PAN
- US_ITIN
- GITHUB_TOKEN
- AWS_ACCESS_KEY
- AZURE_KEY_ID
- SLACK_TOKEN
- HONG_KONG_ID

IMPORTANT RULES:
- Always include ALL entity keys in your output
- Use empty arrays [] for entity types that are not found in the text
- Extract the exact span for each entity (only the entity value β€” no start/end offsets)
- Output valid JSON only"""

text = "Your input text here."

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": text},
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
output = model.generate(input_ids, max_new_tokens=512, do_sample=False)
response = tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True)
result = json.loads(response)
print(result)

Acknowledgements

Downloads last month
1
Safetensors
Model size
4B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support