Learning to Diagnose and Correct Moral Errors: Towards Enhancing Moral Sensitivity in Large Language Models
Paper • 2601.03079 • Published
This model is a Supervised Fine-Tuned (SFT) version of meta-llama/Llama-3.2-1B-Instruct, trained to diagnose social bias violations in conversational replies and rewrite them to align with Moral Foundations Theory (MFT).
It was trained on the Moral Integrity Corpus (MIC) using the pragmatic experimental setting — the core proposed method from the paper "Learning to Diagnose and Correct Moral Errors: Towards Enhancing Moral Sensitivity in Large Language Models" (ACL 2026).
| Property | Value |
|---|---|
| Base model | meta-llama/Llama-3.2-1B-Instruct |
| Fine-tuning method | Full SFT (no LoRA — model is 1B) |
| Training benchmark | MIC (Moral Integrity Corpus) |
| Experimental setting | pragmatic |
| Train size | 2000 examples |
| Task | Correct — diagnose + rewrite socially biased replies |
| Checkpoint step | 375 |
The model reasons over six moral foundations:
| Foundation | Definition |
|---|---|
| Care | Wanting someone or something to be safe, healthy, and happy |
| Fairness | Wanting individuals or groups to be treated equally or equitably |
| Liberty | Wanting people to be free to make their own decisions |
| Loyalty | Wanting unity and seeing people keep promises to an in-group |
| Authority | Wanting to respect social roles, duties, privacy, peace, and order |
| Sanctity | Wanting people and things to be clean, pure, innocent, and holy |
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
model_id = "MoralMachine/moral-awareness-socialbias-llama3.2-1B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = "left"
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16)
MFT_PREFIX = (
"There are the six moral foundations. "
"Care: wanting someone or something to be safe, healthy, and happy. "
"Fairness: wanting to see individuals or groups treated equally or equitably. "
"Liberty: wanting people to be free to make their own decisions. "
"Loyalty: wanting unity and seeing people keep promises or obligations to an in-group. "
"Authority: wanting to respect social roles, duties, privacy, peace, and order. "
"Sanctity: wanting people and things to be clean, pure, innocent, and holy."
)
question = "What do you think about women in leadership roles?"
answer = "Women are too emotional to be good leaders. Men are just better at making tough decisions."
prompt = (
f"{MFT_PREFIX} "
f'<Prompt>: "{question}"; <Reply>: "{answer}". '
"###Diagnosis: "
)
output = pipe(prompt, max_new_tokens=512, do_sample=False)[0]["generated_text"]
print(output)
Disagree (biased) cases — model learns to diagnose and rewrite:
[MFT_PREFIX] <Prompt>: "{question}"; <Reply>: "{answer}". ###Diagnosis: {pragmatic_reasoning} Therefore, the <Revised Reply> is "{worker_answer}"
Agree (unbiased) cases — model learns to recognize and stop:
[MFT_PREFIX] <Prompt>: "{question}"; <Reply>: "{answer}". ###Diagnosis: {pragmatic_reasoning up to step (5)}
The pragmatic diagnosis follows a 5-step reasoning chain:
@article{chen2026learning,
title={Learning to Diagnose and Correct Moral Errors: Towards Enhancing Moral Sensitivity in Large Language Models},
author={Chen, Bocheng and Zi, Han and Chen, Xi and Zhang, Xitong and Johnson, Kristen and Liu, Guangliang},
journal={arXiv preprint arXiv:2601.03079},
year={2026}
}