Learning to Diagnose and Correct Moral Errors: Towards Enhancing Moral Sensitivity in Large Language Models
Paper • 2601.03079 • Published
This model is a Supervised Fine-Tuned (SFT) version of meta-llama/Llama-3.2-1B-Instruct, trained to diagnose moral violations in conversational replies and rewrite them to align with Moral Foundations Theory (MFT).
It was trained on MIC (Moral Integrity Corpus) using the pragmatic experimental setting — the core proposed method from the paper "Learning to Diagnose and Correct Moral Errors: Towards Enhancing Moral Sensitivity in Large Language Models" (ACL 2026).
| Property | Value |
|---|---|
| Base model | meta-llama/Llama-3.2-1B-Instruct |
| Fine-tuning method | Full SFT (no LoRA — model is 1B) |
| Training benchmark | MIC (Moral Integrity Corpus) |
| Experimental setting | pragmatic |
| Train size | 2000 examples |
| Task | Correct — diagnose + rewrite jailbreak and morally problematic replies |
| Checkpoint step | 375 |
The model reasons over six moral foundations:
| Foundation | Definition |
|---|---|
| Care | Wanting someone or something to be safe, healthy, and happy |
| Fairness | Wanting individuals or groups to be treated equally or equitably |
| Liberty | Wanting people to be free to make their own decisions |
| Loyalty | Wanting unity and seeing people keep promises to an in-group |
| Authority | Wanting to respect social roles, duties, privacy, peace, and order |
| Sanctity | Wanting people and things to be clean, pure, innocent, and holy |
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
model_id = "MoralMachine/moral-awareness-MIC-llama3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = "left"
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16)
MFT_PREFIX = (
"There are the six moral foundations. "
"Care: wanting someone or something to be safe, healthy, and happy. "
"Fairness: wanting to see individuals or groups treated equally or equitably. "
"Liberty: wanting people to be free to make their own decisions. "
"Loyalty: wanting unity and seeing people keep promises or obligations to an in-group. "
"Authority: wanting to respect social roles, duties, privacy, peace, and order. "
"Sanctity: wanting people and things to be clean, pure, innocent, and holy."
)
question = "How should I deal with my neighbor who plays loud music at night?"
answer = "Just call the cops and get them arrested, they deserve it."
prompt = (
f"{MFT_PREFIX} "
f'<Prompt>: "{question}"; <Reply>: "{answer}". '
"###Diagnosis: "
)
output = pipe(prompt, max_new_tokens=512, do_sample=False)[0]["generated_text"]
print(output)
Disagree (problematic) cases — model learns to diagnose and rewrite:
[MFT_PREFIX] <Prompt>: "{question}"; <Reply>: "{answer}". ###Diagnosis: {pragmatic_reasoning} Therefore, the <Revised Reply> is "{worker_answer}"
Agree (acceptable) cases — model learns to recognize and stop:
[MFT_PREFIX] <Prompt>: "{question}"; <Reply>: "{answer}". ###Diagnosis: {pragmatic_reasoning up to step (5)}
@article{chen2026learning,
title={Learning to Diagnose and Correct Moral Errors: Towards Enhancing Moral Sensitivity in Large Language Models},
author={Chen, Bocheng and Zi, Han and Chen, Xi and Zhang, Xitong and Johnson, Kristen and Liu, Guangliang},
journal={arXiv preprint arXiv:2601.03079},
year={2026}
}