Diagnose-and-Correct for Socialbias · Llama 3.2-1B

This model is a Supervised Fine-Tuned (SFT) version of meta-llama/Llama-3.2-1B-Instruct, trained to diagnose social bias violations in conversational replies and rewrite them to align with Moral Foundations Theory (MFT).

It was trained on the Moral Integrity Corpus (MIC) using the pragmatic experimental setting — the core proposed method from the paper "Learning to Diagnose and Correct Moral Errors: Towards Enhancing Moral Sensitivity in Large Language Models" (ACL 2026).

Model Details

Property	Value
Base model	`meta-llama/Llama-3.2-1B-Instruct`
Fine-tuning method	Full SFT (no LoRA — model is 1B)
Training benchmark	MIC (Moral Integrity Corpus)
Experimental setting	`pragmatic`
Train size	2000 examples
Task	`Correct` — diagnose + rewrite socially biased replies
Checkpoint step	375

Moral Foundations Theory (MFT)

The model reasons over six moral foundations:

Foundation	Definition
Care	Wanting someone or something to be safe, healthy, and happy
Fairness	Wanting individuals or groups to be treated equally or equitably
Liberty	Wanting people to be free to make their own decisions
Loyalty	Wanting unity and seeing people keep promises to an in-group
Authority	Wanting to respect social roles, duties, privacy, peace, and order
Sanctity	Wanting people and things to be clean, pure, innocent, and holy

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

model_id = "MoralMachine/moral-awareness-socialbias-llama3.2-1B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = "left"
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16)

MFT_PREFIX = (
    "There are the six moral foundations. "
    "Care: wanting someone or something to be safe, healthy, and happy. "
    "Fairness: wanting to see individuals or groups treated equally or equitably. "
    "Liberty: wanting people to be free to make their own decisions. "
    "Loyalty: wanting unity and seeing people keep promises or obligations to an in-group. "
    "Authority: wanting to respect social roles, duties, privacy, peace, and order. "
    "Sanctity: wanting people and things to be clean, pure, innocent, and holy."
)

question = "What do you think about women in leadership roles?"
answer = "Women are too emotional to be good leaders. Men are just better at making tough decisions."

prompt = (
    f"{MFT_PREFIX} "
    f'<Prompt>: "{question}"; <Reply>: "{answer}". '
    "###Diagnosis: "
)

output = pipe(prompt, max_new_tokens=512, do_sample=False)[0]["generated_text"]
print(output)

Training Prompt Format

Disagree (biased) cases — model learns to diagnose and rewrite:

[MFT_PREFIX] <Prompt>: "{question}"; <Reply>: "{answer}". ###Diagnosis: {pragmatic_reasoning} Therefore, the <Revised Reply> is "{worker_answer}"

Agree (unbiased) cases — model learns to recognize and stop:

[MFT_PREFIX] <Prompt>: "{question}"; <Reply>: "{answer}". ###Diagnosis: {pragmatic_reasoning up to step (5)}

The pragmatic diagnosis follows a 5-step reasoning chain:

Whether the Reply contains bias
The sentiment/linguistic cues indicating bias
The underlying moral foundations of the Reply
The moral judgment (agree/disagree) with rationale
The Revised Reply

Citation

@article{chen2026learning,
  title={Learning to Diagnose and Correct Moral Errors: Towards Enhancing Moral Sensitivity in Large Language Models},
  author={Chen, Bocheng and Zi, Han and Chen, Xi and Zhang, Xitong and Johnson, Kristen and Liu, Guangliang},
  journal={arXiv preprint arXiv:2601.03079},
  year={2026}
}