payelb
/

aligned_tinyllama_ultrafeedback_fixed1k_won

reward-modeling

Model card Files Files and versions

aligned_tinyllama_ultrafeedback_fixed1k_won / README.md

payelb's picture

Upload README.md with huggingface_hub

c1ee3b2 verified 3 months ago

|

540 Bytes

	---
	license: apache-2.0
	tags:
	- trl
	- ppo
	- lora
	- alignment
	- reward-modeling
	- ultrafeedback
	base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
	---

	# Aligned TinyLlama on UltraFeedback (fixed-1k prompt pool)

	This model was aligned with TRL PPO using a reward model:
	- payelb/UltraFeedback_openbmb_deberta_1k_fixed_WoN (tag: `won`)

	Key settings:
	- Prompt pool: restricted to the same fixed/selected 1k subset used for RM training (loaded from CSV)
	- PPO updates: 200
	- batch size: 4
	- lr: 1e-05
	- LoRA: r=16, alpha=32, dropout=0.05