-
Monitored Markov Decision Processes
Paper • 2402.06819 • Published -
Generalization in Monitored Markov Decision Processes (Mon-MDPs)
Paper • 2505.08988 • Published -
Bayesian Risk Markov Decision Processes
Paper • 2106.02558 • Published -
Sotopia-RL: Reward Design for Social Intelligence
Paper • 2508.03905 • Published • 23
Collections
Discover the best community collections!
Collections including paper arxiv:1706.03741
-
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Paper • 2305.18290 • Published • 64 -
ICDPO: Effectively Borrowing Alignment Capability of Others via In-context Direct Preference Optimization
Paper • 2402.09320 • Published • 6 -
sDPO: Don't Use Your Data All at Once
Paper • 2403.19270 • Published • 41 -
Dueling RL: Reinforcement Learning with Trajectory Preferences
Paper • 2111.04850 • Published • 2
-
Deep reinforcement learning from human preferences
Paper • 1706.03741 • Published • 4 -
Training language models to follow instructions with human feedback
Paper • 2203.02155 • Published • 24 -
Direct Preference-based Policy Optimization without Reward Modeling
Paper • 2301.12842 • Published -
Woodpecker: Hallucination Correction for Multimodal Large Language Models
Paper • 2310.16045 • Published • 17
-
Dueling RL: Reinforcement Learning with Trajectory Preferences
Paper • 2111.04850 • Published • 2 -
Learning Trajectory Preferences for Manipulators via Iterative Improvement
Paper • 1306.6294 • Published • 3 -
Deep reinforcement learning from human preferences
Paper • 1706.03741 • Published • 4 -
Learning Dynamic Robot-to-Human Object Handover from Human Feedback
Paper • 1603.06390 • Published • 2
-
Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning
Paper • 2310.20587 • Published • 18 -
SELF: Language-Driven Self-Evolution for Large Language Model
Paper • 2310.00533 • Published • 2 -
QLoRA: Efficient Finetuning of Quantized LLMs
Paper • 2305.14314 • Published • 61 -
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
Paper • 2309.14717 • Published • 46
-
Monitored Markov Decision Processes
Paper • 2402.06819 • Published -
Generalization in Monitored Markov Decision Processes (Mon-MDPs)
Paper • 2505.08988 • Published -
Bayesian Risk Markov Decision Processes
Paper • 2106.02558 • Published -
Sotopia-RL: Reward Design for Social Intelligence
Paper • 2508.03905 • Published • 23
-
Dueling RL: Reinforcement Learning with Trajectory Preferences
Paper • 2111.04850 • Published • 2 -
Learning Trajectory Preferences for Manipulators via Iterative Improvement
Paper • 1306.6294 • Published • 3 -
Deep reinforcement learning from human preferences
Paper • 1706.03741 • Published • 4 -
Learning Dynamic Robot-to-Human Object Handover from Human Feedback
Paper • 1603.06390 • Published • 2
-
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Paper • 2305.18290 • Published • 64 -
ICDPO: Effectively Borrowing Alignment Capability of Others via In-context Direct Preference Optimization
Paper • 2402.09320 • Published • 6 -
sDPO: Don't Use Your Data All at Once
Paper • 2403.19270 • Published • 41 -
Dueling RL: Reinforcement Learning with Trajectory Preferences
Paper • 2111.04850 • Published • 2
-
Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning
Paper • 2310.20587 • Published • 18 -
SELF: Language-Driven Self-Evolution for Large Language Model
Paper • 2310.00533 • Published • 2 -
QLoRA: Efficient Finetuning of Quantized LLMs
Paper • 2305.14314 • Published • 61 -
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
Paper • 2309.14717 • Published • 46
-
Deep reinforcement learning from human preferences
Paper • 1706.03741 • Published • 4 -
Training language models to follow instructions with human feedback
Paper • 2203.02155 • Published • 24 -
Direct Preference-based Policy Optimization without Reward Modeling
Paper • 2301.12842 • Published -
Woodpecker: Hallucination Correction for Multimodal Large Language Models
Paper • 2310.16045 • Published • 17