Diagnose-and-Correct for Toxicity · Llama 3.2-1B

This model is a Supervised Fine-Tuned (SFT) version of meta-llama/Llama-3.2-1B-Instruct, trained to diagnose moral violations in conversational replies and rewrite them to align with Moral Foundations Theory (MFT).

It was trained on RealToxicityPrompts + MIC using the pragmatic_mictoxicity experimental setting — the core proposed method from the paper "Learning to Diagnose and Correct Moral Errors: Towards Enhancing Moral Sensitivity in Large Language Models" (ACL 2026).


Model Details

Property Value
Base model meta-llama/Llama-3.2-1B-Instruct
Fine-tuning method Full SFT (no LoRA — model is 1B)
Training benchmark RealToxicityPrompts + MIC
Experimental setting pragmatic_mictoxicity
Train size 2000 examples
Task Correct — diagnose + rewrite toxic replies
Checkpoint step 375

Moral Foundations Theory (MFT)

The model reasons over six moral foundations:

Foundation Definition
Care Wanting someone or something to be safe, healthy, and happy
Fairness Wanting individuals or groups to be treated equally or equitably
Liberty Wanting people to be free to make their own decisions
Loyalty Wanting unity and seeing people keep promises to an in-group
Authority Wanting to respect social roles, duties, privacy, peace, and order
Sanctity Wanting people and things to be clean, pure, innocent, and holy

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

model_id = "MoralMachine/moral-awareness-toxicity-llama3.2-1B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = "left"
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16)

MFT_PREFIX = (
    "There are the six moral foundations. "
    "Care: wanting someone or something to be safe, healthy, and happy. "
    "Fairness: wanting to see individuals or groups treated equally or equitably. "
    "Liberty: wanting people to be free to make their own decisions. "
    "Loyalty: wanting unity and seeing people keep promises or obligations to an in-group. "
    "Authority: wanting to respect social roles, duties, privacy, peace, and order. "
    "Sanctity: wanting people and things to be clean, pure, innocent, and holy."
)

question = "How should I deal with my neighbor who plays loud music at night?"
answer = "Just call the cops and get them arrested, they deserve it."

prompt = (
    f"{MFT_PREFIX} "
    f'<Prompt>: "{question}"; <Reply>: "{answer}". '
    "###Diagnosis: "
)

output = pipe(prompt, max_new_tokens=512, do_sample=False)[0]["generated_text"]
print(output)

Training Prompt Format

Disagree (problematic) cases — model learns to diagnose and rewrite:

[MFT_PREFIX] <Prompt>: "{question}"; <Reply>: "{answer}". ###Diagnosis: {pragmatic_reasoning} Therefore, the <Revised Reply> is "{worker_answer}"

Agree (acceptable) cases — model learns to recognize and stop:

[MFT_PREFIX] <Prompt>: "{question}"; <Reply>: "{answer}". ###Diagnosis: {pragmatic_reasoning up to step (5)}

Citation

@article{chen2026learning,
  title={Learning to Diagnose and Correct Moral Errors: Towards Enhancing Moral Sensitivity in Large Language Models},
  author={Chen, Bocheng and Zi, Han and Chen, Xi and Zhang, Xitong and Johnson, Kristen and Liu, Guangliang},
  journal={arXiv preprint arXiv:2601.03079},
  year={2026}
}
Downloads last month
7
Safetensors
Model size
1B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MoralMachine/Diagnose-and-Correct-for-Toxicity-Llama-3-2-1B

Finetuned
(1606)
this model
Quantizations
1 model

Space using MoralMachine/Diagnose-and-Correct-for-Toxicity-Llama-3-2-1B 1

Paper for MoralMachine/Diagnose-and-Correct-for-Toxicity-Llama-3-2-1B