VLAarchtests4 / docs /maniskill_pickclutter_correction_log_2026-04-01.md
lsnu's picture
Add files using upload-large-folder tool
c725033 verified

ManiSkill PickClutter Correction Log (2026-04-01)

Scope

Public benchmark:

  • ManiSkill 3 PickClutterYCB-v1

Frozen public split reused across all runs:

  • train demos: 32 episodes
  • val demos: 8 episodes
  • eval episodes: 50
  • seed: 17
  • data bundle: /workspace/workspace/data/maniskill_pickclutter/smoke_v3

Fair comparison modes:

  • trunk_only_ft
  • adapter_noop
  • adapter_active_ft

Code Changes

Runner changes:

  • enabled candidate rollout supervision from real ManiSkill states
  • enabled adapter transition-model training/eval
  • unfroze adapter.transition_model
  • set non-zero transition loss weight
  • added ManiSkill smoke planner overrides for the occlusion proxy:
    • adapter_confidence_threshold=0.50
    • retrieve_access_threshold=0.08
    • retrieve_persistence_threshold=0.12
    • retrieve_support_threshold=0.08
    • retrieve_reocclusion_threshold=0.92

Planner correction:

  • changed adapter stage rules from hard vetoes to soft penalties in /workspace/workspace/VLAarchtests3_export/code/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/models/planner.py

Runs

1. smoke_v3 corrected-train baseline

Artifacts:

  • summary: /workspace/workspace/reports/maniskill_pickclutter_smoke_v3/public_benchmark_package_summary.json

Result:

  • trunk_only_ft=0.06
  • adapter_noop=0.06
  • adapter_active_ft=0.06
  • intervention_rate=0.0
  • non_base_selection_rate=0.0

Interpretation:

  • rollout supervision and transition-model training alone were not enough
  • the adapter remained inert

2. smoke_v4_evalprobe_fromv3 corrected-planner eval on smoke_v3 weights

Artifacts:

  • summary: /workspace/workspace/reports/maniskill_pickclutter_smoke_v4_evalprobe_fromv3/public_benchmark_package_summary.json

Result:

  • trunk_only_ft=0.06
  • adapter_noop=0.06
  • adapter_active_ft=0.62
  • delta_active_vs_trunk=+0.56
  • 95% CI=[+0.40, +0.70]
  • intervention_rate=1.0
  • non_base_selection_rate=1.0

Interpretation:

  • this is the first real adapter-specific sign of life on the public benchmark
  • the corrected planner logic is doing the work
  • the improvement is not coming from the shared trunk, because adapter_noop stayed at 0.06

3. smoke_v4 clean retrain with corrected planner active during train and eval

Artifacts:

  • summary: /workspace/workspace/reports/maniskill_pickclutter_smoke_v4/public_benchmark_package_summary.json

Result:

  • trunk_only_ft=0.48
  • adapter_noop=0.04
  • adapter_active_ft=0.04
  • intervention_rate=1.0
  • non_base_selection_rate=1.0
  • delta_active_vs_trunk=-0.44

Interpretation:

  • the clean retrain under corrected planner logic is unstable / regressive
  • the adapter-trained checkpoint collapsed even though active mode intervened
  • current evidence supports the corrected planner as a real eval-time model fix, but not yet as a stable retrain recipe

4. smoke_v5 fair retrain with trunk-action supervision preserved inside adapter training

Artifacts:

  • summary: /workspace/workspace/reports/maniskill_pickclutter_smoke_v5/public_benchmark_package_summary.json

Result:

  • trunk_only_ft=0.04
  • adapter_noop=0.04
  • adapter_active_ft=0.04
  • intervention_rate=1.0
  • non_base_selection_rate=1.0
  • delta_active_vs_trunk=0.00

Interpretation:

  • this fixed the fairness problem from smoke_v4: the adapter-trained checkpoint no longer hid a stronger trunk, because adapter_noop matched trunk_only_ft
  • but the active branch still failed because the planner collapsed to maintain_gap on every decision

5. smoke_v5_val_sweep and held-out smoke_v5_eval_tuned_softerpref

Artifacts:

  • val sweep: /workspace/workspace/reports/maniskill_pickclutter_smoke_v5_val_sweep/summary.json
  • held-out summary: /workspace/workspace/reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.json

Val-selected planner override:

  • mode_preference_bonus=0.75
  • premature_retrieve_penalty=0.5
  • premature_insert_penalty=0.25
  • premature_maintain_penalty=1.0
  • occlusion_maintain_gap_min_access=0.30
  • occlusion_maintain_gap_min_visibility=0.20
  • retrieve_stage_access_threshold=0.18
  • retrieve_stage_reveal_threshold=0.18
  • retrieve_stage_support_threshold=0.18

Validation result:

  • baseline_corrected=0.00
  • soft_pref=0.00
  • softer_pref=0.625
  • retrieve_open=0.625

Held-out result:

  • trunk_only_ft=0.04
  • adapter_noop=0.04
  • adapter_active_ft=0.62
  • delta_active_vs_trunk=+0.58
  • 95% CI=[+0.44, +0.72]
  • intervention_rate=1.0
  • non_base_selection_rate=1.0
  • steps_to_retrieve=1.0
  • signs_of_life=true

Interpretation:

  • this is a fair held-out public-benchmark win on the dense-occlusion proxy
  • the gain is adapter-specific because adapter_noop stayed flat with the trunk baseline
  • the fixed checkpoint from smoke_v5 was viable; the missing piece was planner-stage calibration on the frozen validation split

Current Best Public-Benchmark Evidence

Best adapter-specific evidence currently available:

  • /workspace/workspace/reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.json

Why this is the strongest result:

  • same frozen public train/val/eval split
  • same trained trunk baseline and adapter checkpoint
  • planner override selected on the frozen validation split before the held-out eval run
  • adapter_noop isolates the shared-trunk effect and stays flat
  • only adapter_active_ft improves, so the gain is caused by live adapter intervention

Open Problem

The dense-occlusion proxy now has a fair held-out win, but bag-style and cloth-style public proxy tracks are still missing. The next work item is to bring up the next public proxy benchmark instead of re-running more occlusion-only sweeps.