Learning to Diagnose and Correct Moral Errors: Towards Enhancing Moral Sensitivity in Large Language Models
Paper • 2601.03079 • Published
This model is a Supervised Fine-Tuned (SFT) version of meta-llama/Llama-3.2-1B-Instruct, trained to diagnose moral violations in conversational replies and rewrite them to align with Moral Foundations Theory (MFT).
It was trained on RealToxicityPrompts + MIC using the pragmatic_mictoxicity experimental setting — the core proposed method from the paper "Learning to Diagnose and Correct Moral Errors: Towards Enhancing Moral Sensitivity in Large Language Models" (ACL 2026).
| Property | Value |
|---|---|
| Base model | meta-llama/Llama-3.2-1B-Instruct |
| Fine-tuning method | Full SFT (no LoRA — model is 1B) |
| Training benchmark | RealToxicityPrompts + MIC |
| Experimental setting | pragmatic_mictoxicity |
| Train size | 2000 examples |
| Task | Correct — diagnose + rewrite toxic replies |
| Checkpoint step | 375 |
The model reasons over six moral foundations:
| Foundation | Definition |
|---|---|
| Care | Wanting someone or something to be safe, healthy, and happy |
| Fairness | Wanting individuals or groups to be treated equally or equitably |
| Liberty | Wanting people to be free to make their own decisions |
| Loyalty | Wanting unity and seeing people keep promises to an in-group |
| Authority | Wanting to respect social roles, duties, privacy, peace, and order |
| Sanctity | Wanting people and things to be clean, pure, innocent, and holy |
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
model_id = "MoralMachine/moral-awareness-toxicity-llama3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = "left"
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16)
MFT_PREFIX = (
"There are the six moral foundations. "
"Care: wanting someone or something to be safe, healthy, and happy. "
"Fairness: wanting to see individuals or groups treated equally or equitably. "
"Liberty: wanting people to be free to make their own decisions. "
"Loyalty: wanting unity and seeing people keep promises or obligations to an in-group. "
"Authority: wanting to respect social roles, duties, privacy, peace, and order. "
"Sanctity: wanting people and things to be clean, pure, innocent, and holy."
)
question = "How should I deal with my neighbor who plays loud music at night?"
answer = "Just call the cops and get them arrested, they deserve it."
prompt = (
f"{MFT_PREFIX} "
f'<Prompt>: "{question}"; <Reply>: "{answer}". '
"###Diagnosis: "
)
output = pipe(prompt, max_new_tokens=512, do_sample=False)[0]["generated_text"]
print(output)
Disagree (problematic) cases — model learns to diagnose and rewrite:
[MFT_PREFIX] <Prompt>: "{question}"; <Reply>: "{answer}". ###Diagnosis: {pragmatic_reasoning} Therefore, the <Revised Reply> is "{worker_answer}"
Agree (acceptable) cases — model learns to recognize and stop:
[MFT_PREFIX] <Prompt>: "{question}"; <Reply>: "{answer}". ###Diagnosis: {pragmatic_reasoning up to step (5)}
@article{chen2026learning,
title={Learning to Diagnose and Correct Moral Errors: Towards Enhancing Moral Sensitivity in Large Language Models},
author={Chen, Bocheng and Zi, Han and Chen, Xi and Zhang, Xitong and Johnson, Kristen and Liu, Guangliang},
journal={arXiv preprint arXiv:2601.03079},
year={2026}
}