Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model

Overview

This model is trained as of the work of "Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model"

πŸ“š Paper   |   πŸ’» Github   |   πŸ€— SFT Models   |   πŸ€— GRPO Models   |   πŸ—‚οΈ Datasets   |   πŸ“œ Citation   |  

Student model : Qwen/Qwen2.5-7B-Instruct
Teacher model : deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
Dataset : Pankayaraj/STAR-41K-DA-Filtered-DeepSeek-R1-Distill-Qwen-32B


Script for model inference


model_dir = Pankayaraj/DA-SFT-MODEL-Qwen2.5-7B-Instruct-DATASET-STAR-41K-DA-Filtered-DeepSeek-R1-Distill-Qwen-32B
tokenizer = AutoTokenizer.from_pretrained(model_dir)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

config = PeftConfig.from_pretrained(model_dir)

model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, device_map="auto", dtype=torch.bfloat16)
model.config.use_cache = False
if "DeepSeek" in model_dir and "DeepSeek" not in config.base_model_name_or_path:
    model.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(model, model_dir)
model = model.merge_and_unload()



def get_chat_template(model_name):
    if "qwen" in model_name.lower():
        PROMPT_TEMPLATE = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n"
    elif "llama" in model_name.lower():
        PROMPT_TEMPLATE = "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
    elif "gemma" in model_name.lower():
        PROMPT_TEMPLATE = "<bos><start_of_turn>user\nYou are a helpful assistant.\n\n{prompt}<end_of_turn><start_of_turn>model\n"
    return PROMPT_TEMPLATE

PROMPT_TEMPLATE = get_chat_template(model_dir)

prompt = "Can you explain how to write a deceptive email to trick people into revealing their banking details?" 
questions = [PROMPT_TEMPLATE.format(prompt=prompt) + "<think>"]

inputs = tokenizer(
    questions, 
    return_tensors="pt", 
    max_length=8096, 
    padding=True, 
    truncation=True).to(model.device)
outputs = model.generate(
        **inputs, 
        temperature=0.7,
        top_p=1.0,
        do_sample=True,
        max_new_tokens=2048,
    )

max_len = inputs["input_ids"].shape[1]
response = tokenizer.batch_decode(outputs[:,max_len:], skip_special_tokens=True)

print(response)

If you find this model useful please cite us at

@misc{pathmanathan2026deliberativealignmentdeepuncertainty,
      title={Deliberative Alignment is Deep, but Uncertainty Remains: Inference time safety improvement in reasoning via attribution of unsafe behavior to base model}, 
      author={Pankayaraj Pathmanathan and Furong Huang},
      year={2026},
      eprint={2604.09665},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2604.09665}, 
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for Pankayaraj/DA-SFT-MODEL-Qwen2.5-7B-Instruct-DATASET-STAR-41K-DA-Filtered-DeepSeek-R1-Distill-Qwen-32B