Extended-refusal LoRA ablations (Qwen2.5-3B-Instruct)

This repository holds four LoRA adapters trained on the extended-refusal dataset. Each subfolder is one run; all share the same base model and LoRA architecture but differ in which part of the model response was supervised (see Training).

Base model

Load the adapters on top of:

Qwen/Qwen2.5-3B-Instruct

These are PEFT LoRA weights only (~479MB per adapter), not merged full weights. At inference you always load the base model first, then attach the adapter from the subfolder you want.

Subfolders (which adapter to use)

Subfolder on the Hub Training target (response_aspect) Description
full-lora full Supervise the full extended-refusal style response.
explanation-only-lora explanation_only Supervise only the explanation segment.
justification-only-lora justification_only Supervise only the justification segment.
refusal-only-lora refusal_only Supervise only the refusal segment.

Each subfolder contains at least: adapter_model.safetensors, adapter_config.json, and tokenizer files (tokenizer.json, tokenizer_config.json, chat_template.jinja, etc.) aligned with the base instruct model.

Installation

pip install "transformers>=4.43" "peft>=0.11" torch accelerate

Use a recent transformers / peft pair that supports Qwen2.5 and PeftModel.from_pretrained(..., subfolder=...).

How to load (Python)

Pick SUBFOLDER from the table above (e.g. full-lora).

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE_MODEL_ID = "Qwen/Qwen2.5-3B-Instruct"
ADAPTER_REPO_ID = "CSMaya/er_ablations_qwen2.5_3b"
SUBFOLDER = "full-lora"  # or explanation-only-lora, justification-only-lora, refusal-only-lora

tokenizer = AutoTokenizer.from_pretrained(ADAPTER_REPO_ID, subfolder=SUBFOLDER, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_ID,
    torch_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
    device_map="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, ADAPTER_REPO_ID, subfolder=SUBFOLDER)

# Example chat turn (match Qwen2.5 Instruct chat format)
messages = [{"role": "user", "content": "Your prompt here."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training

Training script and configs live in the course project (train_sft.py + YAML under configs/). Summary for these Qwen runs:

Setting Value
Base model Qwen/Qwen2.5-3B-Instruct
Dataset HarethahMo/extended-refusal (train split)
Method LoRA (not QLoRA) on attention + MLP projections
LoRA r / alpha / dropout 64 / 128 / 0.05
Epochs 3
Learning rate 1e-6
Max sequence length 2048
Per-device batch / grad accumulation 1 / 8
Precision bf16 when supported
Gradient checkpointing on

response_aspect selects which segment of the formatted assistant output contributes to the SFT loss (full, explanation_only, justification_only, refusal_only). See the project train_sft.py flag --response_aspect and dataset loader for details.

Citation

If you use these adapters, cite the extended-refusal dataset and the Qwen2.5 base model per their respective model/dataset cards.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CSMaya/er_ablations_qwen2.5_3b

Base model

Qwen/Qwen2.5-3B
Adapter
(1116)
this model

Dataset used to train CSMaya/er_ablations_qwen2.5_3b