Diagnose-and-Correct for Jailbreak · Llama 3.2-1B

This model is a Supervised Fine-Tuned (SFT) version of meta-llama/Llama-3.2-1B-Instruct, trained to diagnose moral violations in conversational replies and rewrite them to align with Moral Foundations Theory (MFT).

It was trained on MIC (Moral Integrity Corpus) using the pragmatic experimental setting — the core proposed method from the paper "Learning to Diagnose and Correct Moral Errors: Towards Enhancing Moral Sensitivity in Large Language Models" (ACL 2026).

Model Details

Property	Value
Base model	`meta-llama/Llama-3.2-1B-Instruct`
Fine-tuning method	Full SFT (no LoRA — model is 1B)
Training benchmark	MIC (Moral Integrity Corpus)
Experimental setting	`pragmatic`
Train size	2000 examples
Task	`Correct` — diagnose + rewrite jailbreak and morally problematic replies
Checkpoint step	375

Moral Foundations Theory (MFT)

The model reasons over six moral foundations:

Foundation	Definition
Care	Wanting someone or something to be safe, healthy, and happy
Fairness	Wanting individuals or groups to be treated equally or equitably
Liberty	Wanting people to be free to make their own decisions
Loyalty	Wanting unity and seeing people keep promises to an in-group
Authority	Wanting to respect social roles, duties, privacy, peace, and order
Sanctity	Wanting people and things to be clean, pure, innocent, and holy

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

model_id = "MoralMachine/moral-awareness-MIC-llama3.2-1B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = "left"
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16)

MFT_PREFIX = (
    "There are the six moral foundations. "
    "Care: wanting someone or something to be safe, healthy, and happy. "
    "Fairness: wanting to see individuals or groups treated equally or equitably. "
    "Liberty: wanting people to be free to make their own decisions. "
    "Loyalty: wanting unity and seeing people keep promises or obligations to an in-group. "
    "Authority: wanting to respect social roles, duties, privacy, peace, and order. "
    "Sanctity: wanting people and things to be clean, pure, innocent, and holy."
)

question = "How should I deal with my neighbor who plays loud music at night?"
answer = "Just call the cops and get them arrested, they deserve it."

prompt = (
    f"{MFT_PREFIX} "
    f'<Prompt>: "{question}"; <Reply>: "{answer}". '
    "###Diagnosis: "
)

output = pipe(prompt, max_new_tokens=512, do_sample=False)[0]["generated_text"]
print(output)

Training Prompt Format

Disagree (problematic) cases — model learns to diagnose and rewrite:

[MFT_PREFIX] <Prompt>: "{question}"; <Reply>: "{answer}". ###Diagnosis: {pragmatic_reasoning} Therefore, the <Revised Reply> is "{worker_answer}"

Agree (acceptable) cases — model learns to recognize and stop:

[MFT_PREFIX] <Prompt>: "{question}"; <Reply>: "{answer}". ###Diagnosis: {pragmatic_reasoning up to step (5)}

Citation

@article{chen2026learning,
  title={Learning to Diagnose and Correct Moral Errors: Towards Enhancing Moral Sensitivity in Large Language Models},
  author={Chen, Bocheng and Zi, Han and Chen, Xi and Zhang, Xitong and Johnson, Kristen and Liu, Guangliang},
  journal={arXiv preprint arXiv:2601.03079},
  year={2026}
}