| license: apache-2.0 | |
| tags: | |
| - trl | |
| - ppo | |
| - lora | |
| - alignment | |
| - reward-modeling | |
| - ultrafeedback | |
| base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0 | |
| # Aligned TinyLlama on UltraFeedback (fixed-1k prompt pool) | |
| This model was aligned with **TRL PPO** using a reward model: | |
| - **payelb/UltraFeedback_openbmb_deberta_1k_fixed_WoN** (tag: `won`) | |
| Key settings: | |
| - Prompt pool: restricted to the same fixed/selected 1k subset used for RM training (loaded from CSV) | |
| - PPO updates: 200 | |
| - batch size: 4 | |
| - lr: 1e-05 | |
| - LoRA: r=16, alpha=32, dropout=0.05 | |