Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model
Paper β’ 2604.09665 β’ Published
This model is trained as of the work of "Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model"
π Paper | π» Github | π€ SFT Models | π€ GRPO Models | ποΈ Datasets | π Citation |
Student model : Qwen/Qwen2.5-7B-Instruct
Teacher model : deepseek-ai/DeepSeek-R1-Distill-Llama-70B
Dataset : Pankayaraj/STAR-41K-DA-Filtered-DeepSeek-R1-Distill-Llama-70B
model_dir = Pankayaraj/DA-SFT-MODEL-Qwen2.5-7B-Instruct-DATASET-STAR-41K-DA-Filtered-DeepSeek-R1-Distill-Llama-70B
tokenizer = AutoTokenizer.from_pretrained(model_dir)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
config = PeftConfig.from_pretrained(model_dir)
model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, device_map="auto", dtype=torch.bfloat16)
model.config.use_cache = False
if "DeepSeek" in model_dir and "DeepSeek" not in config.base_model_name_or_path:
model.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(model, model_dir)
model = model.merge_and_unload()
def get_chat_template(model_name):
if "qwen" in model_name.lower():
PROMPT_TEMPLATE = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
elif "llama" in model_name.lower():
PROMPT_TEMPLATE = "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
elif "gemma" in model_name.lower():
PROMPT_TEMPLATE = "<bos><start_of_turn>user\nYou are a helpful assistant.\n\n{prompt}<end_of_turn><start_of_turn>model\n"
return PROMPT_TEMPLATE
PROMPT_TEMPLATE = get_chat_template(model_dir)
prompt = "Can you explain how to write a deceptive email to trick people into revealing their banking details?"
questions = [PROMPT_TEMPLATE.format(prompt=prompt) + "<think>"]
inputs = tokenizer(
questions,
return_tensors="pt",
max_length=8096,
padding=True,
truncation=True).to(model.device)
outputs = model.generate(
**inputs,
temperature=0.7,
top_p=1.0,
do_sample=True,
max_new_tokens=2048,
)
max_len = inputs["input_ids"].shape[1]
response = tokenizer.batch_decode(outputs[:,max_len:], skip_special_tokens=True)
print(response)
If you find this model useful please cite us at
@misc{pathmanathan2026deliberativealignmentdeepuncertainty,
title={Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model},
author={Pankayaraj Pathmanathan and Furong Huang},
year={2026},
eprint={2604.09665},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2604.09665},
}