VLAarchtests4 / PUBLIC_BENCHMARK_RESULTS.md

lsnu

Add PickClutter smoke_v5 benchmark GIF renders

1973904 verified 20 days ago

preview code

raw

history blame contribute delete

5.67 kB

Public Benchmark Results

All dates below refer to 2026-04-01 UTC.

Dense Occluded Retrieval Proxy

Benchmark:

ManiSkill PickClutterYCB-v1

Completed runs

reports/maniskill_pickclutter_smoke/public_benchmark_package_summary.json
- trunk = 0.04
- noop = 0.04
- active = 0.04
reports/maniskill_pickclutter_smoke_v2/public_benchmark_package_summary.json
- trunk = 0.04
- noop = 0.32
- active = 0.32
- not adapter-specific because active == noop
reports/maniskill_pickclutter_smoke_v3/public_benchmark_package_summary.json
- trunk = 0.06
- noop = 0.06
- active = 0.06
reports/maniskill_pickclutter_smoke_v4/public_benchmark_package_summary.json
- trunk = 0.48
- noop = 0.04
- active = 0.04
- active intervened but regressed badly
reports/maniskill_pickclutter_smoke_v4_evalprobe_fromv3/public_benchmark_package_summary.json
- trunk = 0.06
- noop = 0.06
- active = 0.62
- delta = +0.56
- eval-probe only, not a clean retrain
reports/maniskill_pickclutter_smoke_v5/public_benchmark_package_summary.json
- trunk = 0.04
- noop = 0.04
- active = 0.04
- fairness-preserving retrain, but active still failed
reports/maniskill_pickclutter_smoke_v5_val_sweep/summary.json
- val-only planner sweep
- baseline_corrected = 0.00
- soft_pref = 0.00
- softer_pref = 0.625
- retrieve_open = 0.625
reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.json
- trunk = 0.04
- noop = 0.04
- active = 0.62
- delta = +0.58
- 95% CI = [0.44, 0.72]
- intervention_rate = 1.0
- non_base_selection_rate = 1.0
reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref_gifs/
- full rerender of all 50 held-out seeds for trunk_only_ft and adapter_active_ft
- includes index.html, INDEX.md, and manifest.json
- rerender manifest reports 0 success mismatches against the saved benchmark json files

Exact `smoke_v5` eval tuning carried to held-out

mode_preference_bonus = 0.75
premature_retrieve_penalty = 0.5
premature_insert_penalty = 0.25
premature_maintain_penalty = 1.0
occlusion_maintain_gap_min_access = 0.30
occlusion_maintain_gap_min_visibility = 0.20
retrieve_stage_access_threshold = 0.18
retrieve_stage_reveal_threshold = 0.18
retrieve_stage_support_threshold = 0.18

Bag Retrieval Proxy

Benchmark:

ManiSkill public bridge basket retrieval proxy

Completed runs

reports/maniskill_bag_bridge_smoke_v1/trunk_only_ft_seed17.json
- 0.32
reports/maniskill_bag_bridge_smoke_v1/adapter_noop_seed17.json
- 0.00
reports/maniskill_bag_bridge_smoke_v1/adapter_active_ft_seed17.json
- 0.48
reports/maniskill_bag_bridge_smoke_v1/trunk_only_ft_seed23.json
- 0.48
reports/maniskill_bag_bridge_smoke_v1/adapter_noop_seed23.json
- 0.08
reports/maniskill_bag_bridge_smoke_v1/adapter_active_ft_seed23.json
- 0.00

Seed-23 validation sweep

reports/maniskill_bag_bridge_val_sweep_seed23/summary.json

Configs:

default
- trunk = 0.125
- noop = 0.125
- active = 0.00
less_bonus
- trunk = 0.125
- noop = 0.125
- active = 0.125
- intervention preserved
conservative
- trunk = 0.125
- noop = 0.125
- active = 0.125
- intervention effectively disabled
low_bonus_high_thresh
- trunk = 0.125
- noop = 0.125
- active = 0.125
- intervention effectively disabled

Corrected held-out evals

reports/maniskill_bag_bridge_eval_less_bonus_seed17/public_benchmark_package_summary.json
- trunk = 0.32
- noop = 0.00
- active = 0.48
- delta = +0.16
reports/maniskill_bag_bridge_eval_less_bonus_seed23/public_benchmark_package_summary.json
- trunk = 0.48
- noop = 0.08
- active = 0.48
- delta = +0.00
reports/maniskill_bag_bridge_eval_less_bonus_2seed_manual_summary.json
- trunk = 0.40
- noop = 0.04
- active = 0.48
- delta = +0.08
- run-bootstrap CI [0.00, 0.16]

Cloth Retrieval Proxy

Benchmark:

ManiSkill public bridge cloth retrieval proxy

Completed held-out seeds

seed17
- trunk = 0.04
- noop = 0.04
- active = 0.10
- intervention = 0.3369
- non_base = 0.2674
seed23
- trunk = 0.04
- noop = 0.02
- active = 0.02
- intervention = 0.0
- non_base = 0.0
seed29
- trunk = 0.04
- noop = 0.04
- active = 0.04
- intervention = 0.0
- non_base = 0.0

3-seed aggregate:

trunk = 0.0400
noop = 0.0333
active = 0.0533
delta = +0.0133

Seed-23 cloth validation sweep

reports/maniskill_cloth_bridge_val_sweep_seed23/summary.json

Configs:

default
- trunk = 0.25
- noop = 0.125
- active = 0.125
- intervention = 0.0
low_thresh
- active = 0.125
- intervention = 0.2
- non_base = 0.0667
low_thresh_less_bonus
- active = 0.125
- intervention = 0.2
- non_base = 0.0667
very_low_thresh_less_bonus
- active = 0.125
- intervention = 1.0
- non_base = 0.5333

Interpretation:

seed23 cloth was not recoverable by eval-side planner tuning alone

Single-Seed Combined Proxy Suite

reports/public_proxy_suite_smoke_v1/combined_summary.json

Single-seed summary:

occlusion proxy: +0.58
bag proxy: +0.16
cloth proxy: +0.06
macro delta: +0.267

This combined single-seed picture is useful historically, but the stronger current read is:

occlusion: strong
bag: modestly positive across corrected 2-seed evaluation
cloth: weak/inconclusive across 3 seeds

Public Benchmark Results

Dense Occluded Retrieval Proxy

Completed runs

Exact smoke_v5 eval tuning carried to held-out

Bag Retrieval Proxy

Completed runs

Seed-23 validation sweep

Corrected held-out evals

Cloth Retrieval Proxy

Completed held-out seeds

Seed-23 cloth validation sweep

Single-Seed Combined Proxy Suite

Exact `smoke_v5` eval tuning carried to held-out