Finance LLaMA 3.1 8B - TRUE DPO Trained π―
This model was trained using genuine Direct Preference Optimization (DPO) on financial question-answering data. This is TRUE DPO training, not supervised fine-tuning.
π¬ What Makes This Special
TRUE Preference Optimization
- β Trained with BOTH chosen AND rejected responses
- β DPO loss function for preference learning
- β Learned to prefer high-quality over low-quality responses
- β NOT just supervised fine-tuning on good examples
π Training Details
- Method: Direct Preference Optimization (DPO)
- Base Model: gandhiraketla277/finance-llama-3.1-8b-merged
- Dataset: gandhiraketla277/finance-dpo-dataset
- Training Framework: TRL 0.21.0 (latest)
- Training Date: 2025-08-24
- Epochs: 1 (optimal for DPO)
- PEFT: LoRA adapters for efficient training
π― Training Process
- Loaded base finance model with existing knowledge
- Applied LoRA adapters for efficient training
- Used DPOTrainer with preference pairs (chosen vs rejected)
- Trained for 1 epoch using preference optimization loss
- Model learned to discriminate between good and bad responses
π Usage
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch
# Load base model
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
base_model = AutoModelForCausalLM.from_pretrained(
"gandhiraketla277/finance-llama-3.1-8b-merged",
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16
)
# Load TRUE DPO trained adapters
model = PeftModel.from_pretrained(base_model, "gandhiraketla277/finance-llama-3.1-8b-dpo-trained")
tokenizer = AutoTokenizer.from_pretrained("gandhiraketla277/finance-llama-3.1-8b-merged")
# Generate response
prompt = "What is the best investment strategy for beginners?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
β¨ Expected Improvements
Thanks to TRUE DPO training, this model should provide:
- Better financial advice quality
- More structured and helpful responses
- Improved preference for accurate information
- Enhanced discrimination between good and bad advice
β οΈ Important Notes
- Educational purposes only - not financial advice
- Consult professionals for investment decisions
- Trained on preference data to improve response quality
- Uses TRUE DPO methodology for preference optimization
π Related Resources
- Base Model: gandhiraketla277/finance-llama-3.1-8b-merged
- Training Dataset: gandhiraketla277/finance-dpo-dataset
- Training Method: Direct Preference Optimization (DPO)
This model represents TRUE DPO training with preference optimization, not supervised fine-tuning! π―
- Downloads last month
- 2
Model tree for gandhiraketla277/finance-llama-3.1-8b-dpo-trained
Base model
meta-llama/Llama-3.1-8B