Model Card for Model ID

Qwen Guardrail SFT (LoRA)

Model Description

This model is a safety-aligned LoRA fine-tuning of Qwen3-0.6B, trained to improve refusal behavior and safe responses using the Anthropic HH-RLHF dataset.

The goal of this model is to:

Refuse harmful or unsafe user requests
Provide safer alternative guidance
Improve conversational safety alignment

Base Model

Qwen/Qwen3-0.6B

Training Method

Supervised Fine-Tuning (SFT)
Framework: Unsloth + TRL
Parameter-efficient tuning using LoRA

Dataset

Anthropic HH-RLHF (chosen responses only)

Intended Use

Guardrail assistant
Safety layer for chatbots
Research on alignment and refusal behavior

Limitations

May produce weak or indirect refusals
Not production-grade safety without DPO/RLHF stage
Should not be used as the sole safety mechanism

How to Load

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    "Piyu12/qwen-guardrail-sft",
    max_seq_length=2048,
    load_in_4bit=True,
)

- PEFT 0.18.1

Downloads last month: -; Downloads are not tracked for this model. How to track