Moral Judgment β€” Fusion Β· Llama 3.2 3B

This model is a Supervised Fine-Tuned (SFT) version of meta-llama/Llama-3.2-3B (base), trained to perform six-step pragmatic moral reasoning over conversational replies and produce a moral judgment: agree (morally acceptable), disagree (morally problematic), or neutral.

It is the best-performing model from the paper "Moral Pragmatics in Language Models" (TACL 2025), trained under the fusion experimental setting on 23,500 examples from the Moral Integrity Corpus (MIC).


Model Details

Property Value
Base model meta-llama/Llama-3.2-3B (base, not instruct)
Fine-tuning method Full SFT (no LoRA)
Training benchmark MIC (Moral Integrity Corpus)
Experimental setting fusion
Train size 23,500 examples
Task Judgment β€” classify reply as agree / disagree / neutral
Epochs 5
Checkpoint step 4,900

Moral Foundations Theory (MFT)

The model reasons over six moral foundations, which are provided as a prefix to every input:

Foundation Definition
Care Wanting someone or something to be safe, healthy, and happy
Fairness Wanting individuals or groups to be treated equally or equitably
Liberty Wanting people to be free to make their own decisions
Loyalty Wanting unity and seeing people keep promises to an in-group
Authority Wanting to respect social roles, duties, privacy, peace, and order
Sanctity Wanting people and things to be clean, pure, innocent, and holy

Judgment Labels

Label Meaning
agree The reply is morally acceptable β€” its actions align with moral foundations
disagree The reply is morally problematic β€” its actions violate moral foundations
neutral The reply is morally neutral β€” no clear moral valence

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

model_id = "MoralMachine/moral-judgment-fusion-llama3.2-3B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.padding_side = "left"
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16)

MFT_PREFIX = (
    "There are the six moral foundations. "
    "Care: wanting someone or something to be safe, healthy, and happy. "
    "Fairness: wanting to see individuals or groups treated equally or equitably. "
    "Liberty: wanting people to be free to make their own decisions. "
    "Loyalty: wanting unity and seeing people keep promises or obligations to an in-group. "
    "Authority: wanting to respect social roles, duties, privacy, peace, and order. "
    "Sanctity: wanting people and things to be clean, pure, innocent, and holy."
)

question = "Is it okay to ghost someone you've been dating for a month?"
reply = "Absolutely, just stop replying β€” they'll get the hint."
rot = "It is wrong to ignore people without explanation."  # optional

prompt = (
    f"{MFT_PREFIX} "
    f'There is a conversation "Prompt: {question} Reply: {reply}" '
    f'There is a Rule-of-Thumb (RoT): "{rot}". '
    "###Inference: "
)

output = pipe(prompt, max_new_tokens=512, do_sample=False)[0]["generated_text"]

# Extract the judgment
generated = output[len(prompt):]
judgment = generated.split("###Judgment:")[-1].strip().rstrip(".")
# judgment ∈ {"agree", "disagree", "neutral"}
print("Judgment:", judgment)
print("Reasoning:", generated)

Six-Step Reasoning Chain

Before producing a judgment, the model generates a structured reasoning chain:

  1. Actions β€” What actions does the reply describe or imply?
  2. Consequences β€” What are the potential consequences of those actions?
  3. Moral Foundations β€” Which MFT foundations do those actions engage?
  4. Regulation β€” Do the actions up-regulate or down-regulate those foundations?
  5. Sentiment β€” What is the reply's sentiment toward those consequences?
  6. Judgment β€” Final moral verdict: agree / disagree / neutral

Example output

Input prompt:

[MFT_PREFIX] There is a conversation "Prompt: Is it okay to lie to protect
someone's feelings? Reply: Sometimes a little white lie is totally fine if
it avoids hurting someone unnecessarily." There is a Rule-of-Thumb (RoT):
"It is wrong to deceive people even for good reasons.". ###Inference:

Model output:

(1) The Actions mentioned in the Reply are telling a little white lie to
avoid hurting someone unnecessarily.
(2) The potential consequence ... the lie could protect the person's
feelings, but could also damage trust later.
(3) The underlying moral foundations are care, fairness, loyalty ...
(4) The Actions up-regulate care by prioritizing emotional safety ...
They down-regulate fairness and loyalty by using deception ...
(5) The sentiment of the Reply is positive, framing the action as
"totally fine" and protective.
(6) The Reply endorses an action with negative moral consequences ...
###Judgment: disagree.

Training Prompt Format

Each training example is formatted as a single text sequence:

[MFT_PREFIX] There is a conversation "Prompt: {question} Reply: {answer}"
There is a Rule-of-Thumb (RoT): "{rot}".
###Inference: {six_step_reasoning_chain}
###Judgment: {agree|disagree|neutral}.

The fusion inference chain ({six_step_reasoning_chain}) was generated by an external LLM combining both MFT-grounded and Judgment-specific reasoning signals β€” the key innovation over the ours baseline.

At test time, the model receives only the prefix up to ###Inference: and generates the full chain autoregressively. The Rule-of-Thumb is optional; the model produces valid reasoning without it.


Experimental Settings

Setting Description
baseline0 No MFT context, direct judgment only
baseline+ MFT foundation names listed, no reasoning chain
ours Full MFT prefix + LLM-generated Judgment inference chain
fusion β˜… Best model. Full MFT prefix + fused MFT+Judgment chain
COT Chain-of-Thought without MFT grounding

Links


Citation

@article{moral-pragmatics-tacl-2025,
  title   = {Moral Pragmatics in Language Models},
  journal = {Transactions of the Association for Computational Linguistics (TACL)},
  year    = {2025}
}
Downloads last month
4
Safetensors
Model size
3B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for MoralMachine/Moral-Reasoning-Llama-3.2-3B

Finetuned
(428)
this model

Spaces using MoralMachine/Moral-Reasoning-Llama-3.2-3B 3