ReframeBot-DPO-Llama3.1-8B

LoRA adapter for meta-llama/Meta-Llama-3.1-8B-Instruct, further aligned with Direct Preference Optimisation (DPO) on top of the SFT checkpoint (ReframeBot-SFT-Llama3.1-8B).

DPO training steered the model towards empathetic, open-ended Socratic responses and away from direct advice, dismissiveness, or unsafe content. This is the production adapter used in the ReframeBot system.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import PeftModel
import torch

base = "meta-llama/Meta-Llama-3.1-8B-Instruct"
adapter = "Nhatminh1234/ReframeBot-DPO-Llama3.1-8B"

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, quantization_config=bnb, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)

Training Details

Hyperparameter	Value
Starting checkpoint	ReframeBot-SFT-Llama3.1-8B
LoRA rank (r)	16
LoRA alpha	32
LoRA dropout	0.05
Learning rate	5e-6
Optimizer	paged_adamw_8bit
Effective batch size	48 (2 × grad_accum 24)
Epochs	3
Beta (KL penalty)	0.1
Max sequence length	512
Quantization	4-bit NF4, bfloat16 compute
Hardware	NVIDIA RTX 5070 (laptop, 8 GB VRAM)

Dataset: 1,400 preference pairs {prompt, chosen, rejected} generated with GPT-4. Chosen responses demonstrate empathy + open-ended questioning; rejected responses contain direct advice, dismissiveness, or unsafe content.

Evaluation

Metric	Score
BERTScore Relevance (F1)	0.832
BERTScore Faithfulness (F1)	0.849
Response Consistency	0.732

Intended Use

Designed as a component in the ReframeBot system — not a standalone mental-health tool. Must not be used for clinical intervention or crisis support without human oversight.

Project

GitHub: ReframeBot

Downloads last month: 21

Model tree for Nhatminh1234/ReframeBot-DPO-Llama3.1-8B

Base model

meta-llama/Llama-3.1-8B

Finetuned

meta-llama/Llama-3.1-8B-Instruct

Adapter

(1961)

this model