MiniCPM-1B-sft-bf16 - Kto
Model Description
This model is a Merged Standalone Model fine-tuned from openbmb/MiniCPM-1B-sft-bf16 using the Kto training method.
Kahneman-Tversky Optimization - Binary preference optimization based on Prospect Theory
This model was developed as part of thesis research on LLM Alignment using Preference Optimization Methods.
Model Details
| Property | Value |
|---|---|
| Base Model | openbmb/MiniCPM-1B-sft-bf16 |
| Training Method | Kto |
| Model Type | Merged Standalone Model |
| Training Date | December 2025 |
| Framework | PyTorch + Transformers + PEFT |
Benchmark Results
Benchmark evaluation pending or encountered errors.
Comparative Analysis
The following chart compares this method against other training approaches on the same base model:
Training Configuration
| Parameter | Value |
|---|---|
| Epochs | 1 |
| Batch Size | 2 |
| Gradient Accumulation | 8 |
| Effective Batch Size | 16 |
| Learning Rate | 2e-4 |
| Max Sequence Length | 512 |
| LoRA Rank | 16 |
| LoRA Alpha | 32 |
Combined Dataset (kto_combined) - Round-Robin Sampling from three sources:
| Data Source | Sample Count (Full) | Counts |
|---|---|---|
| Entropic HH-RLHF | 321,600 | 61,568 |
| Stanford Human Preferences (SHP) | 697,436 | 38,984 |
| OpenAssistant Conversations v1 | 16,810 | 8,904 |
| Total | 1,035,846 | 109,456 |
Evidence: Training log lines 106-121
Actual training statistics (with subset split train_prefs[:32090]):
- Number of training samples: 13300 (paired samples)
- Number of validation samples: 700 (5%)
- Round-Robin generation: 1,130 generation from each source
- Seed: 42 (for repeatability)
Combined Preference Dataset (kto_combined)
Training uses a Combined Preference Dataset built via Round-Robin Sampling from three sources:
| Source | Total Samples | Interactions |
|---|---|---|
| Anthropic HH-RLHF | 321,600 | 61,568 |
| Stanford Human Preferences (SHP) | 697,436 | 38,984 |
| OpenAssistant Conversations v1 | 16,810 | 8,904 |
| Total | 1,035,846 | 109,456 |
Actual Training Statistics (subset split train_prefs[:32090]):
- Training samples: 13,300 (paired examples)
- Validation samples: 700 (5%)
- Round-Robin distribution: 1,130 interactions per source
- Seed: 42 (for reproducibility)
Usage
Direct Loading (Merged Model)
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("MiniCPM-1B-sft-bf16")
tokenizer = AutoTokenizer.from_pretrained("MiniCPM-1B-sft-bf16")
# Generate text
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))
Training Methodology
Kto
Kahneman-Tversky Optimization - Binary preference optimization based on Prospect Theory
Key Features:
- Binary feedback signals (thumbs up/down)
- No need for paired preference data
- Reference model for KL divergence regularization
- Prospect Theory-inspired loss function
Citation
If you use this model in your research, please cite:
@misc{minicpm_1b_sft_bf16_kto_2025,
title = {MiniCPM-1B-sft-bf16 Fine-tuned with Kto},
author = {Thesis Research},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/Nishef/MiniCPM-1B-sft-bf16-Full_KTO_20251225_185339}
}
Repository Structure
.
├── adapter_config.json # LoRA configuration
├── adapter_model.safetensors # Model weights
├── tokenizer files # Tokenizer configuration
├── eval_summary.csv # Evaluation results
├── thesis_plots/ # Visualization assets
│ ├── benchmark_results.png
│ └── training_loss.png
└── README.md # This file
Acknowledgments
- Base Model: openbmb/MiniCPM-1B-sft-bf16
- Training Framework: Hugging Face Transformers
- Fine-tuning Library: PEFT
License
This model is released under the Apache 2.0 license.
This model was created as part of thesis research on LLM alignment using preference optimization methods.
- Downloads last month
- 4
Model tree for Nishef/MiniCPM-1B-sft-bf16-Full_KTO_20251225_185339-merged
Base model
openbmb/MiniCPM-1B-sft-bf16