File size: 974 Bytes
b4ac377
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# DebateFloor HTTP Rollout Evaluation

Base URL: `https://aniketasla-debatefloor.hf.space`

| Policy | Episodes | Mean reward | Mean calibration | Success rate |
|---|---:|---:|---:|---:|
| naive_high_no_investigation | 3 | 0.267 | -0.200 | 33.33% |

| calibrated_scripted_investigator | 3 | 0.554 | 0.567 | 100.00% |



## Per-Episode Rows



| Policy | Task | Seed | Reward | Calibration | Confidence | Steps |

|---|---|---:|---:|---:|---|---:|

| naive_high_no_investigation | clean_claim | 42 | 0.800 | 1.0 | HIGH | 1 |

| naive_high_no_investigation | contradictory_claim | 42 | 0.000 | -0.8 | HIGH | 1 |

| naive_high_no_investigation | distribution_shift_claim | 42 | 0.000 | -0.8 | HIGH | 1 |
| calibrated_scripted_investigator | clean_claim | 42 | 0.762 | 1.0 | HIGH | 4 |

| calibrated_scripted_investigator | contradictory_claim | 42 | 0.547 | 0.6 | MED | 7 |
| calibrated_scripted_investigator | distribution_shift_claim | 42 | 0.352 | 0.1 | LOW | 8 |