ManiSkill PickClutter Correction Log (2026-04-01)
Scope
Public benchmark:
- ManiSkill 3
PickClutterYCB-v1
Frozen public split reused across all runs:
- train demos:
32episodes - val demos:
8episodes - eval episodes:
50 - seed:
17 - data bundle:
/workspace/workspace/data/maniskill_pickclutter/smoke_v3
Fair comparison modes:
trunk_only_ftadapter_noopadapter_active_ft
Code Changes
Runner changes:
- enabled candidate rollout supervision from real ManiSkill states
- enabled adapter transition-model training/eval
- unfroze
adapter.transition_model - set non-zero transition loss weight
- added ManiSkill smoke planner overrides for the occlusion proxy:
adapter_confidence_threshold=0.50retrieve_access_threshold=0.08retrieve_persistence_threshold=0.12retrieve_support_threshold=0.08retrieve_reocclusion_threshold=0.92
Planner correction:
- changed adapter stage rules from hard vetoes to soft penalties in
/workspace/workspace/VLAarchtests3_export/code/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/models/planner.py
Runs
1. smoke_v3 corrected-train baseline
Artifacts:
- summary:
/workspace/workspace/reports/maniskill_pickclutter_smoke_v3/public_benchmark_package_summary.json
Result:
trunk_only_ft=0.06adapter_noop=0.06adapter_active_ft=0.06intervention_rate=0.0non_base_selection_rate=0.0
Interpretation:
- rollout supervision and transition-model training alone were not enough
- the adapter remained inert
2. smoke_v4_evalprobe_fromv3 corrected-planner eval on smoke_v3 weights
Artifacts:
- summary:
/workspace/workspace/reports/maniskill_pickclutter_smoke_v4_evalprobe_fromv3/public_benchmark_package_summary.json
Result:
trunk_only_ft=0.06adapter_noop=0.06adapter_active_ft=0.62delta_active_vs_trunk=+0.5695% CI=[+0.40, +0.70]intervention_rate=1.0non_base_selection_rate=1.0
Interpretation:
- this is the first real adapter-specific sign of life on the public benchmark
- the corrected planner logic is doing the work
- the improvement is not coming from the shared trunk, because
adapter_noopstayed at0.06
3. smoke_v4 clean retrain with corrected planner active during train and eval
Artifacts:
- summary:
/workspace/workspace/reports/maniskill_pickclutter_smoke_v4/public_benchmark_package_summary.json
Result:
trunk_only_ft=0.48adapter_noop=0.04adapter_active_ft=0.04intervention_rate=1.0non_base_selection_rate=1.0delta_active_vs_trunk=-0.44
Interpretation:
- the clean retrain under corrected planner logic is unstable / regressive
- the adapter-trained checkpoint collapsed even though active mode intervened
- current evidence supports the corrected planner as a real eval-time model fix, but not yet as a stable retrain recipe
4. smoke_v5 fair retrain with trunk-action supervision preserved inside adapter training
Artifacts:
- summary:
/workspace/workspace/reports/maniskill_pickclutter_smoke_v5/public_benchmark_package_summary.json
Result:
trunk_only_ft=0.04adapter_noop=0.04adapter_active_ft=0.04intervention_rate=1.0non_base_selection_rate=1.0delta_active_vs_trunk=0.00
Interpretation:
- this fixed the fairness problem from
smoke_v4: the adapter-trained checkpoint no longer hid a stronger trunk, becauseadapter_noopmatchedtrunk_only_ft - but the active branch still failed because the planner collapsed to
maintain_gapon every decision
5. smoke_v5_val_sweep and held-out smoke_v5_eval_tuned_softerpref
Artifacts:
- val sweep:
/workspace/workspace/reports/maniskill_pickclutter_smoke_v5_val_sweep/summary.json - held-out summary:
/workspace/workspace/reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.json
Val-selected planner override:
mode_preference_bonus=0.75premature_retrieve_penalty=0.5premature_insert_penalty=0.25premature_maintain_penalty=1.0occlusion_maintain_gap_min_access=0.30occlusion_maintain_gap_min_visibility=0.20retrieve_stage_access_threshold=0.18retrieve_stage_reveal_threshold=0.18retrieve_stage_support_threshold=0.18
Validation result:
baseline_corrected=0.00soft_pref=0.00softer_pref=0.625retrieve_open=0.625
Held-out result:
trunk_only_ft=0.04adapter_noop=0.04adapter_active_ft=0.62delta_active_vs_trunk=+0.5895% CI=[+0.44, +0.72]intervention_rate=1.0non_base_selection_rate=1.0steps_to_retrieve=1.0signs_of_life=true
Interpretation:
- this is a fair held-out public-benchmark win on the dense-occlusion proxy
- the gain is adapter-specific because
adapter_noopstayed flat with the trunk baseline - the fixed checkpoint from
smoke_v5was viable; the missing piece was planner-stage calibration on the frozen validation split
Current Best Public-Benchmark Evidence
Best adapter-specific evidence currently available:
/workspace/workspace/reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.json
Why this is the strongest result:
- same frozen public train/val/eval split
- same trained trunk baseline and adapter checkpoint
- planner override selected on the frozen validation split before the held-out eval run
adapter_noopisolates the shared-trunk effect and stays flat- only
adapter_active_ftimproves, so the gain is caused by live adapter intervention
Open Problem
The dense-occlusion proxy now has a fair held-out win, but bag-style and cloth-style public proxy tracks are still missing. The next work item is to bring up the next public proxy benchmark instead of re-running more occlusion-only sweeps.