Finance LLaMA 3.1 8B - TRUE DPO Trained 🎯

This model was trained using genuine Direct Preference Optimization (DPO) on financial question-answering data. This is TRUE DPO training, not supervised fine-tuning.

πŸ”¬ What Makes This Special

TRUE Preference Optimization

  • βœ… Trained with BOTH chosen AND rejected responses
  • βœ… DPO loss function for preference learning
  • βœ… Learned to prefer high-quality over low-quality responses
  • βœ… NOT just supervised fine-tuning on good examples

πŸ“Š Training Details

  • Method: Direct Preference Optimization (DPO)
  • Base Model: gandhiraketla277/finance-llama-3.1-8b-merged
  • Dataset: gandhiraketla277/finance-dpo-dataset
  • Training Framework: TRL 0.21.0 (latest)
  • Training Date: 2025-08-24
  • Epochs: 1 (optimal for DPO)
  • PEFT: LoRA adapters for efficient training

🎯 Training Process

  1. Loaded base finance model with existing knowledge
  2. Applied LoRA adapters for efficient training
  3. Used DPOTrainer with preference pairs (chosen vs rejected)
  4. Trained for 1 epoch using preference optimization loss
  5. Model learned to discriminate between good and bad responses

πŸš€ Usage

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch

# Load base model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

base_model = AutoModelForCausalLM.from_pretrained(
    "gandhiraketla277/finance-llama-3.1-8b-merged",
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

# Load TRUE DPO trained adapters
model = PeftModel.from_pretrained(base_model, "gandhiraketla277/finance-llama-3.1-8b-dpo-trained")
tokenizer = AutoTokenizer.from_pretrained("gandhiraketla277/finance-llama-3.1-8b-merged")

# Generate response
prompt = "What is the best investment strategy for beginners?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

✨ Expected Improvements

Thanks to TRUE DPO training, this model should provide:

  • Better financial advice quality
  • More structured and helpful responses
  • Improved preference for accurate information
  • Enhanced discrimination between good and bad advice

⚠️ Important Notes

  • Educational purposes only - not financial advice
  • Consult professionals for investment decisions
  • Trained on preference data to improve response quality
  • Uses TRUE DPO methodology for preference optimization

πŸ”— Related Resources


This model represents TRUE DPO training with preference optimization, not supervised fine-tuning! 🎯

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for gandhiraketla277/finance-llama-3.1-8b-dpo-trained