Model Card for Model ID

Qwen Guardrail SFT (LoRA)

Model Description

This model is a safety-aligned LoRA fine-tuning of Qwen3-0.6B, trained to improve refusal behavior and safe responses using the Anthropic HH-RLHF dataset.

The goal of this model is to:

  • Refuse harmful or unsafe user requests
  • Provide safer alternative guidance
  • Improve conversational safety alignment

Base Model

  • Qwen/Qwen3-0.6B

Training Method

  • Supervised Fine-Tuning (SFT)
  • Framework: Unsloth + TRL
  • Parameter-efficient tuning using LoRA

Dataset

  • Anthropic HH-RLHF (chosen responses only)

Intended Use

  • Guardrail assistant
  • Safety layer for chatbots
  • Research on alignment and refusal behavior

Limitations

  • May produce weak or indirect refusals
  • Not production-grade safety without DPO/RLHF stage
  • Should not be used as the sole safety mechanism

How to Load

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    "Piyu12/qwen-guardrail-sft",
    max_seq_length=2048,
    load_in_4bit=True,
)

- PEFT 0.18.1
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support