Preference datasets trl-lib/hh-rlhf-helpful-base Viewer • Updated Jan 8, 2025 • 46.2k • 1.07k • 3 trl-lib/lm-human-preferences-descriptiveness Viewer • Updated Jan 8, 2025 • 6.26k • 44 • 1 trl-lib/lm-human-preferences-sentiment Viewer • Updated Jan 8, 2025 • 6.26k • 16 trl-lib/rlaif-v Viewer • Updated Jan 8, 2025 • 83.1k • 409 • 3
Prompt-completion datasets trl-lib/tldr Viewer • Updated Jan 8, 2025 • 130k • 2.82k • 30 trl-lib/OpenMathReasoning Viewer • Updated Apr 26, 2025 • 3.2M • 294
Unpaired preference datasets trl-lib/ultrafeedback-gpt-3.5-turbo-helpfulness Viewer • Updated Jan 8, 2025 • 16.6k • 40 • 4 trl-lib/kto-mix-14k Viewer • Updated Mar 25, 2024 • 15k • 90 • 9
trl-lib/ultrafeedback-gpt-3.5-turbo-helpfulness Viewer • Updated Jan 8, 2025 • 16.6k • 40 • 4
Online-DPO trl-lib/pythia-1b-deduped-tldr-online-dpo 1B • Updated Aug 2, 2024 • 7 trl-lib/pythia-1b-deduped-tldr-sft 1B • Updated Aug 2, 2024 • 162 trl-lib/pythia-6.9b-deduped-tldr-online-dpo 7B • Updated Aug 2, 2024 trl-lib/pythia-2.8b-deduped-tldr-sft Updated Aug 2, 2024 • 5
Stepwise supervision datasets trl-lib/math_shepherd Viewer • Updated Jan 8, 2025 • 445k • 4.57k • 12 trl-lib/prm800k Viewer • Updated Jan 8, 2025 • 41.2k • 441 • 3
Prompt-only datasets trl-lib/ultrafeedback-prompt Viewer • Updated Jan 8, 2025 • 39.8k • 524 • 9 trl-lib/DeepMath-103K Viewer • Updated Nov 14, 2025 • 103k • 4.26k • 9
Comparing DPO with IPO and KTO A collection of chat models to explore the differences between three alignment techniques: DPO, IPO, and KTO. teknium/OpenHermes-2.5-Mistral-7B Text Generation • Updated Feb 19, 2024 • 144k • 895 Intel/orca_dpo_pairs Viewer • Updated Nov 29, 2023 • 12.9k • 2.25k • 321 trl-lib/OpenHermes-2-Mistral-7B-ipo-beta-0.1-steps-200 Updated Dec 20, 2023 • 6 trl-lib/OpenHermes-2-Mistral-7B-ipo-beta-0.2-steps-200 Updated Dec 20, 2023 • 5
Preference datasets trl-lib/hh-rlhf-helpful-base Viewer • Updated Jan 8, 2025 • 46.2k • 1.07k • 3 trl-lib/lm-human-preferences-descriptiveness Viewer • Updated Jan 8, 2025 • 6.26k • 44 • 1 trl-lib/lm-human-preferences-sentiment Viewer • Updated Jan 8, 2025 • 6.26k • 16 trl-lib/rlaif-v Viewer • Updated Jan 8, 2025 • 83.1k • 409 • 3
Stepwise supervision datasets trl-lib/math_shepherd Viewer • Updated Jan 8, 2025 • 445k • 4.57k • 12 trl-lib/prm800k Viewer • Updated Jan 8, 2025 • 41.2k • 441 • 3
Prompt-completion datasets trl-lib/tldr Viewer • Updated Jan 8, 2025 • 130k • 2.82k • 30 trl-lib/OpenMathReasoning Viewer • Updated Apr 26, 2025 • 3.2M • 294
Prompt-only datasets trl-lib/ultrafeedback-prompt Viewer • Updated Jan 8, 2025 • 39.8k • 524 • 9 trl-lib/DeepMath-103K Viewer • Updated Nov 14, 2025 • 103k • 4.26k • 9
Unpaired preference datasets trl-lib/ultrafeedback-gpt-3.5-turbo-helpfulness Viewer • Updated Jan 8, 2025 • 16.6k • 40 • 4 trl-lib/kto-mix-14k Viewer • Updated Mar 25, 2024 • 15k • 90 • 9
trl-lib/ultrafeedback-gpt-3.5-turbo-helpfulness Viewer • Updated Jan 8, 2025 • 16.6k • 40 • 4
Comparing DPO with IPO and KTO A collection of chat models to explore the differences between three alignment techniques: DPO, IPO, and KTO. teknium/OpenHermes-2.5-Mistral-7B Text Generation • Updated Feb 19, 2024 • 144k • 895 Intel/orca_dpo_pairs Viewer • Updated Nov 29, 2023 • 12.9k • 2.25k • 321 trl-lib/OpenHermes-2-Mistral-7B-ipo-beta-0.1-steps-200 Updated Dec 20, 2023 • 6 trl-lib/OpenHermes-2-Mistral-7B-ipo-beta-0.2-steps-200 Updated Dec 20, 2023 • 5
Online-DPO trl-lib/pythia-1b-deduped-tldr-online-dpo 1B • Updated Aug 2, 2024 • 7 trl-lib/pythia-1b-deduped-tldr-sft 1B • Updated Aug 2, 2024 • 162 trl-lib/pythia-6.9b-deduped-tldr-online-dpo 7B • Updated Aug 2, 2024 trl-lib/pythia-2.8b-deduped-tldr-sft Updated Aug 2, 2024 • 5