Model Card for Model ID
Qwen Guardrail SFT (LoRA)
Model Description
This model is a safety-aligned LoRA fine-tuning of Qwen3-0.6B, trained to improve refusal behavior and safe responses using the Anthropic HH-RLHF dataset.
The goal of this model is to:
- Refuse harmful or unsafe user requests
- Provide safer alternative guidance
- Improve conversational safety alignment
Base Model
- Qwen/Qwen3-0.6B
Training Method
- Supervised Fine-Tuning (SFT)
- Framework: Unsloth + TRL
- Parameter-efficient tuning using LoRA
Dataset
- Anthropic HH-RLHF (chosen responses only)
Intended Use
- Guardrail assistant
- Safety layer for chatbots
- Research on alignment and refusal behavior
Limitations
- May produce weak or indirect refusals
- Not production-grade safety without DPO/RLHF stage
- Should not be used as the sole safety mechanism
How to Load
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
"Piyu12/qwen-guardrail-sft",
max_seq_length=2048,
load_in_4bit=True,
)
- PEFT 0.18.1