Extended-refusal LoRA ablations (Qwen2.5-3B-Instruct)
This repository holds four LoRA adapters trained on the extended-refusal dataset. Each subfolder is one run; all share the same base model and LoRA architecture but differ in which part of the model response was supervised (see Training).
Base model
Load the adapters on top of:
These are PEFT LoRA weights only (~479MB per adapter), not merged full weights. At inference you always load the base model first, then attach the adapter from the subfolder you want.
Subfolders (which adapter to use)
| Subfolder on the Hub | Training target (response_aspect) |
Description |
|---|---|---|
full-lora |
full |
Supervise the full extended-refusal style response. |
explanation-only-lora |
explanation_only |
Supervise only the explanation segment. |
justification-only-lora |
justification_only |
Supervise only the justification segment. |
refusal-only-lora |
refusal_only |
Supervise only the refusal segment. |
Each subfolder contains at least: adapter_model.safetensors, adapter_config.json, and tokenizer files (tokenizer.json, tokenizer_config.json, chat_template.jinja, etc.) aligned with the base instruct model.
Installation
pip install "transformers>=4.43" "peft>=0.11" torch accelerate
Use a recent transformers / peft pair that supports Qwen2.5 and PeftModel.from_pretrained(..., subfolder=...).
How to load (Python)
Pick SUBFOLDER from the table above (e.g. full-lora).
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
BASE_MODEL_ID = "Qwen/Qwen2.5-3B-Instruct"
ADAPTER_REPO_ID = "CSMaya/er_ablations_qwen2.5_3b"
SUBFOLDER = "full-lora" # or explanation-only-lora, justification-only-lora, refusal-only-lora
tokenizer = AutoTokenizer.from_pretrained(ADAPTER_REPO_ID, subfolder=SUBFOLDER, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL_ID,
torch_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16,
device_map="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, ADAPTER_REPO_ID, subfolder=SUBFOLDER)
# Example chat turn (match Qwen2.5 Instruct chat format)
messages = [{"role": "user", "content": "Your prompt here."}]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Training
Training script and configs live in the course project (train_sft.py + YAML under configs/). Summary for these Qwen runs:
| Setting | Value |
|---|---|
| Base model | Qwen/Qwen2.5-3B-Instruct |
| Dataset | HarethahMo/extended-refusal (train split) |
| Method | LoRA (not QLoRA) on attention + MLP projections |
LoRA r / alpha / dropout |
64 / 128 / 0.05 |
| Epochs | 3 |
| Learning rate | 1e-6 |
| Max sequence length | 2048 |
| Per-device batch / grad accumulation | 1 / 8 |
| Precision | bf16 when supported |
| Gradient checkpointing | on |
response_aspect selects which segment of the formatted assistant output contributes to the SFT loss (full, explanation_only, justification_only, refusal_only). See the project train_sft.py flag --response_aspect and dataset loader for details.
Citation
If you use these adapters, cite the extended-refusal dataset and the Qwen2.5 base model per their respective model/dataset cards.
- Downloads last month
- -