singhanshuman commited on
Commit
fb72c8f
·
verified ·
1 Parent(s): 77ed572

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +31 -6
README.md CHANGED
@@ -1,12 +1,37 @@
1
  ---
2
- title: Sim Oprl
3
- emoji: 🏢
4
- colorFrom: purple
5
- colorTo: gray
6
  sdk: gradio
7
- sdk_version: 6.12.0
8
  app_file: app.py
9
  pinned: false
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Sim-OPRL
3
+ emoji: 🎯
4
+ colorFrom: blue
5
+ colorTo: indigo
6
  sdk: gradio
7
+ sdk_version: 4.44.0
8
  app_file: app.py
9
  pinned: false
10
  ---
11
 
12
+ # Sim-OPRL: Preference Elicitation for Offline RL
13
+
14
+ Reproduction of **[Sim-OPRL (ICLR 2025)](https://arxiv.org/abs/2406.18450)** by Pace, Schölkopf, Rätsch & Ramponi.
15
+
16
+ ## What this demo does
17
+
18
+ Two CartPole trajectories are simulated using a learned **dynamics model ensemble**, selected by the Sim-OPRL acquisition strategy:
19
+
20
+ > maximise **reward uncertainty** (we learn the most here) − λ · **transition uncertainty** (stay in-distribution)
21
+
22
+ Click which trajectory keeps the pole balanced longer. Each click trains the **Bradley-Terry reward model** via preference feedback. Every 5 clicks the policy is re-optimised with REINFORCE on the learned reward — using **no ground-truth reward labels at any point**.
23
+
24
+ ## How to run locally
25
+
26
+ ```bash
27
+ pip install -r requirements.txt
28
+ python train.py # collect data + train dynamics model + run comparison
29
+ python plot_results.py # generate main figure
30
+ python app.py # launch interactive demo
31
+ ```
32
+
33
+ ## Results
34
+
35
+ ![Main figure](results/main_figure.png)
36
+
37
+ Sim-OPRL reaches higher policy return with fewer preference queries than both baselines, by asking *informative* questions rather than random ones.