Public Benchmark Results
All dates below refer to 2026-04-01 UTC.
Dense Occluded Retrieval Proxy
Benchmark:
- ManiSkill
PickClutterYCB-v1
Completed runs
reports/maniskill_pickclutter_smoke/public_benchmark_package_summary.jsontrunk = 0.04noop = 0.04active = 0.04
reports/maniskill_pickclutter_smoke_v2/public_benchmark_package_summary.jsontrunk = 0.04noop = 0.32active = 0.32- not adapter-specific because
active == noop
reports/maniskill_pickclutter_smoke_v3/public_benchmark_package_summary.jsontrunk = 0.06noop = 0.06active = 0.06
reports/maniskill_pickclutter_smoke_v4/public_benchmark_package_summary.jsontrunk = 0.48noop = 0.04active = 0.04- active intervened but regressed badly
reports/maniskill_pickclutter_smoke_v4_evalprobe_fromv3/public_benchmark_package_summary.jsontrunk = 0.06noop = 0.06active = 0.62delta = +0.56- eval-probe only, not a clean retrain
reports/maniskill_pickclutter_smoke_v5/public_benchmark_package_summary.jsontrunk = 0.04noop = 0.04active = 0.04- fairness-preserving retrain, but active still failed
reports/maniskill_pickclutter_smoke_v5_val_sweep/summary.json- val-only planner sweep
baseline_corrected = 0.00soft_pref = 0.00softer_pref = 0.625retrieve_open = 0.625
reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.jsontrunk = 0.04noop = 0.04active = 0.62delta = +0.5895% CI = [0.44, 0.72]intervention_rate = 1.0non_base_selection_rate = 1.0
reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref_gifs/- full rerender of all
50held-out seeds fortrunk_only_ftandadapter_active_ft - includes
index.html,INDEX.md, andmanifest.json - rerender manifest reports
0success mismatches against the saved benchmark json files
- full rerender of all
Exact smoke_v5 eval tuning carried to held-out
mode_preference_bonus = 0.75premature_retrieve_penalty = 0.5premature_insert_penalty = 0.25premature_maintain_penalty = 1.0occlusion_maintain_gap_min_access = 0.30occlusion_maintain_gap_min_visibility = 0.20retrieve_stage_access_threshold = 0.18retrieve_stage_reveal_threshold = 0.18retrieve_stage_support_threshold = 0.18
Bag Retrieval Proxy
Benchmark:
- ManiSkill public bridge basket retrieval proxy
Completed runs
reports/maniskill_bag_bridge_smoke_v1/trunk_only_ft_seed17.json0.32
reports/maniskill_bag_bridge_smoke_v1/adapter_noop_seed17.json0.00
reports/maniskill_bag_bridge_smoke_v1/adapter_active_ft_seed17.json0.48
reports/maniskill_bag_bridge_smoke_v1/trunk_only_ft_seed23.json0.48
reports/maniskill_bag_bridge_smoke_v1/adapter_noop_seed23.json0.08
reports/maniskill_bag_bridge_smoke_v1/adapter_active_ft_seed23.json0.00
Seed-23 validation sweep
reports/maniskill_bag_bridge_val_sweep_seed23/summary.json
Configs:
defaulttrunk = 0.125noop = 0.125active = 0.00
less_bonustrunk = 0.125noop = 0.125active = 0.125- intervention preserved
conservativetrunk = 0.125noop = 0.125active = 0.125- intervention effectively disabled
low_bonus_high_threshtrunk = 0.125noop = 0.125active = 0.125- intervention effectively disabled
Corrected held-out evals
reports/maniskill_bag_bridge_eval_less_bonus_seed17/public_benchmark_package_summary.jsontrunk = 0.32noop = 0.00active = 0.48delta = +0.16
reports/maniskill_bag_bridge_eval_less_bonus_seed23/public_benchmark_package_summary.jsontrunk = 0.48noop = 0.08active = 0.48delta = +0.00
reports/maniskill_bag_bridge_eval_less_bonus_2seed_manual_summary.jsontrunk = 0.40noop = 0.04active = 0.48delta = +0.08- run-bootstrap CI
[0.00, 0.16]
Cloth Retrieval Proxy
Benchmark:
- ManiSkill public bridge cloth retrieval proxy
Completed held-out seeds
seed17trunk = 0.04noop = 0.04active = 0.10intervention = 0.3369non_base = 0.2674
seed23trunk = 0.04noop = 0.02active = 0.02intervention = 0.0non_base = 0.0
seed29trunk = 0.04noop = 0.04active = 0.04intervention = 0.0non_base = 0.0
3-seed aggregate:
trunk = 0.0400noop = 0.0333active = 0.0533delta = +0.0133
Seed-23 cloth validation sweep
reports/maniskill_cloth_bridge_val_sweep_seed23/summary.json
Configs:
defaulttrunk = 0.25noop = 0.125active = 0.125intervention = 0.0
low_threshactive = 0.125intervention = 0.2non_base = 0.0667
low_thresh_less_bonusactive = 0.125intervention = 0.2non_base = 0.0667
very_low_thresh_less_bonusactive = 0.125intervention = 1.0non_base = 0.5333
Interpretation:
- seed23 cloth was not recoverable by eval-side planner tuning alone
Single-Seed Combined Proxy Suite
reports/public_proxy_suite_smoke_v1/combined_summary.json
Single-seed summary:
- occlusion proxy:
+0.58 - bag proxy:
+0.16 - cloth proxy:
+0.06 - macro delta:
+0.267
This combined single-seed picture is useful historically, but the stronger current read is:
- occlusion: strong
- bag: modestly positive across corrected 2-seed evaluation
- cloth: weak/inconclusive across 3 seeds