VLAarchtests4 / PUBLIC_BENCHMARK_RESULTS.md
lsnu's picture
Add PickClutter smoke_v5 benchmark GIF renders
1973904 verified

Public Benchmark Results

All dates below refer to 2026-04-01 UTC.

Dense Occluded Retrieval Proxy

Benchmark:

  • ManiSkill PickClutterYCB-v1

Completed runs

  • reports/maniskill_pickclutter_smoke/public_benchmark_package_summary.json
    • trunk = 0.04
    • noop = 0.04
    • active = 0.04
  • reports/maniskill_pickclutter_smoke_v2/public_benchmark_package_summary.json
    • trunk = 0.04
    • noop = 0.32
    • active = 0.32
    • not adapter-specific because active == noop
  • reports/maniskill_pickclutter_smoke_v3/public_benchmark_package_summary.json
    • trunk = 0.06
    • noop = 0.06
    • active = 0.06
  • reports/maniskill_pickclutter_smoke_v4/public_benchmark_package_summary.json
    • trunk = 0.48
    • noop = 0.04
    • active = 0.04
    • active intervened but regressed badly
  • reports/maniskill_pickclutter_smoke_v4_evalprobe_fromv3/public_benchmark_package_summary.json
    • trunk = 0.06
    • noop = 0.06
    • active = 0.62
    • delta = +0.56
    • eval-probe only, not a clean retrain
  • reports/maniskill_pickclutter_smoke_v5/public_benchmark_package_summary.json
    • trunk = 0.04
    • noop = 0.04
    • active = 0.04
    • fairness-preserving retrain, but active still failed
  • reports/maniskill_pickclutter_smoke_v5_val_sweep/summary.json
    • val-only planner sweep
    • baseline_corrected = 0.00
    • soft_pref = 0.00
    • softer_pref = 0.625
    • retrieve_open = 0.625
  • reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.json
    • trunk = 0.04
    • noop = 0.04
    • active = 0.62
    • delta = +0.58
    • 95% CI = [0.44, 0.72]
    • intervention_rate = 1.0
    • non_base_selection_rate = 1.0
  • reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref_gifs/
    • full rerender of all 50 held-out seeds for trunk_only_ft and adapter_active_ft
    • includes index.html, INDEX.md, and manifest.json
    • rerender manifest reports 0 success mismatches against the saved benchmark json files

Exact smoke_v5 eval tuning carried to held-out

  • mode_preference_bonus = 0.75
  • premature_retrieve_penalty = 0.5
  • premature_insert_penalty = 0.25
  • premature_maintain_penalty = 1.0
  • occlusion_maintain_gap_min_access = 0.30
  • occlusion_maintain_gap_min_visibility = 0.20
  • retrieve_stage_access_threshold = 0.18
  • retrieve_stage_reveal_threshold = 0.18
  • retrieve_stage_support_threshold = 0.18

Bag Retrieval Proxy

Benchmark:

  • ManiSkill public bridge basket retrieval proxy

Completed runs

  • reports/maniskill_bag_bridge_smoke_v1/trunk_only_ft_seed17.json

    • 0.32
  • reports/maniskill_bag_bridge_smoke_v1/adapter_noop_seed17.json

    • 0.00
  • reports/maniskill_bag_bridge_smoke_v1/adapter_active_ft_seed17.json

    • 0.48
  • reports/maniskill_bag_bridge_smoke_v1/trunk_only_ft_seed23.json

    • 0.48
  • reports/maniskill_bag_bridge_smoke_v1/adapter_noop_seed23.json

    • 0.08
  • reports/maniskill_bag_bridge_smoke_v1/adapter_active_ft_seed23.json

    • 0.00

Seed-23 validation sweep

  • reports/maniskill_bag_bridge_val_sweep_seed23/summary.json

Configs:

  • default
    • trunk = 0.125
    • noop = 0.125
    • active = 0.00
  • less_bonus
    • trunk = 0.125
    • noop = 0.125
    • active = 0.125
    • intervention preserved
  • conservative
    • trunk = 0.125
    • noop = 0.125
    • active = 0.125
    • intervention effectively disabled
  • low_bonus_high_thresh
    • trunk = 0.125
    • noop = 0.125
    • active = 0.125
    • intervention effectively disabled

Corrected held-out evals

  • reports/maniskill_bag_bridge_eval_less_bonus_seed17/public_benchmark_package_summary.json
    • trunk = 0.32
    • noop = 0.00
    • active = 0.48
    • delta = +0.16
  • reports/maniskill_bag_bridge_eval_less_bonus_seed23/public_benchmark_package_summary.json
    • trunk = 0.48
    • noop = 0.08
    • active = 0.48
    • delta = +0.00
  • reports/maniskill_bag_bridge_eval_less_bonus_2seed_manual_summary.json
    • trunk = 0.40
    • noop = 0.04
    • active = 0.48
    • delta = +0.08
    • run-bootstrap CI [0.00, 0.16]

Cloth Retrieval Proxy

Benchmark:

  • ManiSkill public bridge cloth retrieval proxy

Completed held-out seeds

  • seed17
    • trunk = 0.04
    • noop = 0.04
    • active = 0.10
    • intervention = 0.3369
    • non_base = 0.2674
  • seed23
    • trunk = 0.04
    • noop = 0.02
    • active = 0.02
    • intervention = 0.0
    • non_base = 0.0
  • seed29
    • trunk = 0.04
    • noop = 0.04
    • active = 0.04
    • intervention = 0.0
    • non_base = 0.0

3-seed aggregate:

  • trunk = 0.0400
  • noop = 0.0333
  • active = 0.0533
  • delta = +0.0133

Seed-23 cloth validation sweep

  • reports/maniskill_cloth_bridge_val_sweep_seed23/summary.json

Configs:

  • default
    • trunk = 0.25
    • noop = 0.125
    • active = 0.125
    • intervention = 0.0
  • low_thresh
    • active = 0.125
    • intervention = 0.2
    • non_base = 0.0667
  • low_thresh_less_bonus
    • active = 0.125
    • intervention = 0.2
    • non_base = 0.0667
  • very_low_thresh_less_bonus
    • active = 0.125
    • intervention = 1.0
    • non_base = 0.5333

Interpretation:

  • seed23 cloth was not recoverable by eval-side planner tuning alone

Single-Seed Combined Proxy Suite

  • reports/public_proxy_suite_smoke_v1/combined_summary.json

Single-seed summary:

  • occlusion proxy: +0.58
  • bag proxy: +0.16
  • cloth proxy: +0.06
  • macro delta: +0.267

This combined single-seed picture is useful historically, but the stronger current read is:

  • occlusion: strong
  • bag: modestly positive across corrected 2-seed evaluation
  • cloth: weak/inconclusive across 3 seeds