Extended-refusal LoRA ablations (Qwen2.5-3B-Instruct)

This repository holds four LoRA adapters trained on the extended-refusal dataset. Each subfolder is one run; all share the same base model and LoRA architecture but differ in which part of the model response was supervised (see Training).

Base model

Load the adapters on top of:

Qwen/Qwen2.5-3B-Instruct

These are PEFT LoRA weights only (~479MB per adapter), not merged full weights. At inference you always load the base model first, then attach the adapter from the subfolder you want.

Subfolders (which adapter to use)

Subfolder on the Hub	Training target (`response_aspect`)	Description
`full-lora`	`full`	Supervise the full extended-refusal style response.
`explanation-only-lora`	`explanation_only`	Supervise only the explanation segment.
`justification-only-lora`	`justification_only`	Supervise only the justification segment.
`refusal-only-lora`	`refusal_only`	Supervise only the refusal segment.

Each subfolder contains at least: adapter_model.safetensors, adapter_config.json, and tokenizer files (tokenizer.json, tokenizer_config.json, chat_template.jinja, etc.) aligned with the base instruct model.

Installation

pip install "transformers>=4.43" "peft>=0.11" torch accelerate

Use a recent transformers / peft pair that supports Qwen2.5 and PeftModel.from_pretrained(..., subfolder=...).

How to load (Python)

Pick SUBFOLDER from the table above (e.g. full-lora).

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE_MODEL_ID = "Qwen/Qwen2.5-3B-Instruct"
ADAPTER_REPO_ID = "CSMaya/er_ablations_qwen2.5_3b"
SUBFOLDER = "full-lora"  # or explanation-only-lora, justification-only-lora, refusal-only-lora

tokenizer = AutoTokenizer.from_pretrained(ADAPTER_REPO_ID, subfolder=SUBFOLDER, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_ID,
    torch_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
    device_map="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, ADAPTER_REPO_ID, subfolder=SUBFOLDER)

# Example chat turn (match Qwen2.5 Instruct chat format)
messages = [{"role": "user", "content": "Your prompt here."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Training

Training script and configs live in the course project (train_sft.py + YAML under configs/). Summary for these Qwen runs:

Setting	Value
Base model	`Qwen/Qwen2.5-3B-Instruct`
Dataset	`HarethahMo/extended-refusal` (`train` split)
Method	LoRA (not QLoRA) on attention + MLP projections
LoRA `r` / `alpha` / `dropout`	64 / 128 / 0.05
Epochs	3
Learning rate	`1e-6`
Max sequence length	2048
Per-device batch / grad accumulation	1 / 8
Precision	bf16 when supported
Gradient checkpointing	on

response_aspect selects which segment of the formatted assistant output contributes to the SFT loss (full, explanation_only, justification_only, refusal_only). See the project train_sft.py flag --response_aspect and dataset loader for details.

Citation

If you use these adapters, cite the extended-refusal dataset and the Qwen2.5 base model per their respective model/dataset cards.

Downloads last month: -

Model tree for CSMaya/er_ablations_qwen2.5_3b

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Adapter

(1116)

this model

CSMaya
/

er_ablations_qwen2.5_3b

Extended-refusal LoRA ablations (Qwen2.5-3B-Instruct)

Base model

Subfolders (which adapter to use)

Installation

How to load (Python)

Training

Citation

Model tree for CSMaya/er_ablations_qwen2.5_3b

Dataset used to train CSMaya/er_ablations_qwen2.5_3b