File size: 1,271 Bytes
86207d3
fb72c8f
 
 
 
86207d3
 
 
 
 
fb72c8f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
---
title: Sim-OPRL
emoji: 🎯
colorFrom: blue
colorTo: indigo
sdk: gradio
app_file: app.py
pinned: false
---

# Sim-OPRL: Preference Elicitation for Offline RL

Reproduction of **[Sim-OPRL (ICLR 2025)](https://arxiv.org/abs/2406.18450)** by Pace, Schölkopf, Rätsch & Ramponi.

## What this demo does

Two CartPole trajectories are simulated using a learned **dynamics model ensemble**, selected by the Sim-OPRL acquisition strategy:

> maximise **reward uncertainty** (we learn the most here) − λ · **transition uncertainty** (stay in-distribution)

Click which trajectory keeps the pole balanced longer. Each click trains the **Bradley-Terry reward model** via preference feedback. Every 5 clicks the policy is re-optimised with REINFORCE on the learned reward — using **no ground-truth reward labels at any point**.

## How to run locally

```bash
pip install -r requirements.txt
python train.py        # collect data + train dynamics model + run comparison
python plot_results.py # generate main figure
python app.py          # launch interactive demo
```

## Results

![Main figure](results/main_figure.png)

Sim-OPRL reaches higher policy return with fewer preference queries than both baselines, by asking *informative* questions rather than random ones.