VLAarchtests4 / docs /maniskill_pickclutter_correction_log_2026-04-01.md

Add files using upload-large-folder tool

c725033 verified 20 days ago

preview code

raw

history blame contribute delete

5.8 kB

ManiSkill PickClutter Correction Log (2026-04-01)

Scope

Public benchmark:

ManiSkill 3 PickClutterYCB-v1

Frozen public split reused across all runs:

train demos: 32 episodes
val demos: 8 episodes
eval episodes: 50
seed: 17
data bundle: /workspace/workspace/data/maniskill_pickclutter/smoke_v3

Fair comparison modes:

trunk_only_ft
adapter_noop
adapter_active_ft

Code Changes

Runner changes:

enabled candidate rollout supervision from real ManiSkill states
enabled adapter transition-model training/eval
unfroze adapter.transition_model
set non-zero transition loss weight
added ManiSkill smoke planner overrides for the occlusion proxy:
- adapter_confidence_threshold=0.50
- retrieve_access_threshold=0.08
- retrieve_persistence_threshold=0.12
- retrieve_support_threshold=0.08
- retrieve_reocclusion_threshold=0.92

Planner correction:

changed adapter stage rules from hard vetoes to soft penalties in /workspace/workspace/VLAarchtests3_export/code/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/models/planner.py

Runs

1. `smoke_v3` corrected-train baseline

Artifacts:

summary: /workspace/workspace/reports/maniskill_pickclutter_smoke_v3/public_benchmark_package_summary.json

Result:

trunk_only_ft=0.06
adapter_noop=0.06
adapter_active_ft=0.06
intervention_rate=0.0
non_base_selection_rate=0.0

Interpretation:

rollout supervision and transition-model training alone were not enough
the adapter remained inert

2. `smoke_v4_evalprobe_fromv3` corrected-planner eval on `smoke_v3` weights

Artifacts:

summary: /workspace/workspace/reports/maniskill_pickclutter_smoke_v4_evalprobe_fromv3/public_benchmark_package_summary.json

Result:

trunk_only_ft=0.06
adapter_noop=0.06
adapter_active_ft=0.62
delta_active_vs_trunk=+0.56
95% CI=[+0.40, +0.70]
intervention_rate=1.0
non_base_selection_rate=1.0

Interpretation:

this is the first real adapter-specific sign of life on the public benchmark
the corrected planner logic is doing the work
the improvement is not coming from the shared trunk, because adapter_noop stayed at 0.06

3. `smoke_v4` clean retrain with corrected planner active during train and eval

Artifacts:

summary: /workspace/workspace/reports/maniskill_pickclutter_smoke_v4/public_benchmark_package_summary.json

Result:

trunk_only_ft=0.48
adapter_noop=0.04
adapter_active_ft=0.04
intervention_rate=1.0
non_base_selection_rate=1.0
delta_active_vs_trunk=-0.44

Interpretation:

the clean retrain under corrected planner logic is unstable / regressive
the adapter-trained checkpoint collapsed even though active mode intervened
current evidence supports the corrected planner as a real eval-time model fix, but not yet as a stable retrain recipe

4. `smoke_v5` fair retrain with trunk-action supervision preserved inside adapter training

Artifacts:

summary: /workspace/workspace/reports/maniskill_pickclutter_smoke_v5/public_benchmark_package_summary.json

Result:

trunk_only_ft=0.04
adapter_noop=0.04
adapter_active_ft=0.04
intervention_rate=1.0
non_base_selection_rate=1.0
delta_active_vs_trunk=0.00

Interpretation:

this fixed the fairness problem from smoke_v4: the adapter-trained checkpoint no longer hid a stronger trunk, because adapter_noop matched trunk_only_ft
but the active branch still failed because the planner collapsed to maintain_gap on every decision

5. `smoke_v5_val_sweep` and held-out `smoke_v5_eval_tuned_softerpref`

Artifacts:

val sweep: /workspace/workspace/reports/maniskill_pickclutter_smoke_v5_val_sweep/summary.json
held-out summary: /workspace/workspace/reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.json

Val-selected planner override:

mode_preference_bonus=0.75
premature_retrieve_penalty=0.5
premature_insert_penalty=0.25
premature_maintain_penalty=1.0
occlusion_maintain_gap_min_access=0.30
occlusion_maintain_gap_min_visibility=0.20
retrieve_stage_access_threshold=0.18
retrieve_stage_reveal_threshold=0.18
retrieve_stage_support_threshold=0.18

Validation result:

baseline_corrected=0.00
soft_pref=0.00
softer_pref=0.625
retrieve_open=0.625

Held-out result:

trunk_only_ft=0.04
adapter_noop=0.04
adapter_active_ft=0.62
delta_active_vs_trunk=+0.58
95% CI=[+0.44, +0.72]
intervention_rate=1.0
non_base_selection_rate=1.0
steps_to_retrieve=1.0
signs_of_life=true

Interpretation:

this is a fair held-out public-benchmark win on the dense-occlusion proxy
the gain is adapter-specific because adapter_noop stayed flat with the trunk baseline
the fixed checkpoint from smoke_v5 was viable; the missing piece was planner-stage calibration on the frozen validation split

Current Best Public-Benchmark Evidence

Best adapter-specific evidence currently available:

/workspace/workspace/reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.json

Why this is the strongest result:

same frozen public train/val/eval split
same trained trunk baseline and adapter checkpoint
planner override selected on the frozen validation split before the held-out eval run
adapter_noop isolates the shared-trunk effect and stays flat
only adapter_active_ft improves, so the gain is caused by live adapter intervention

Open Problem

The dense-occlusion proxy now has a fair held-out win, but bag-style and cloth-style public proxy tracks are still missing. The next work item is to bring up the next public proxy benchmark instead of re-running more occlusion-only sweeps.

ManiSkill PickClutter Correction Log (2026-04-01)

Scope

Code Changes

Runs

1. smoke_v3 corrected-train baseline

2. smoke_v4_evalprobe_fromv3 corrected-planner eval on smoke_v3 weights

3. smoke_v4 clean retrain with corrected planner active during train and eval

4. smoke_v5 fair retrain with trunk-action supervision preserved inside adapter training

5. smoke_v5_val_sweep and held-out smoke_v5_eval_tuned_softerpref

Current Best Public-Benchmark Evidence

Open Problem

1. `smoke_v3` corrected-train baseline

2. `smoke_v4_evalprobe_fromv3` corrected-planner eval on `smoke_v3` weights

3. `smoke_v4` clean retrain with corrected planner active during train and eval

4. `smoke_v5` fair retrain with trunk-action supervision preserved inside adapter training

5. `smoke_v5_val_sweep` and held-out `smoke_v5_eval_tuned_softerpref`