VLAarchtests3 / docs /elastic_occlusion_iteration_2026-03-31.md
lsnu's picture
Add files using upload-large-folder tool
aa584de verified

Elastic Occlusion Iteration Report

Date: 2026-03-31 UTC

Scope

This iteration focused on the trunk + adapter path in:

  • /workspace/workspace/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual

The target was to verify whether the adapter could show a light novelty signal on the proxy benchmark without breaking the no-op-safe trunk path.

What Was Fixed

1. Proposal-target alignment bug

The original fast adapter runs were training against teacher shortlist labels, not the adapter's own proposal set.

Observed failure:

  • candidate_utility in the fast proxy dataset always had oracle argmax at slot 0
  • adapter training therefore learned to prefer base_action

Fixes:

  • train/run_experiment.py
    • now rebuilds adapter datasets when proposal-aligned targets are missing
  • train/build_aligned_proposal_dataset.py
    • now supports adapter-wrapped models
  • tests/test_adapter_dataset_alignment.py
    • added regression tests for missing aligned targets

Result:

  • rebuilt aligned train dataset no longer collapses to slot 0
  • aligned oracle winners are non-base proposals across tasks

2. Proposal-rollout alignment for transition training

The lightweight transition path originally had no aligned rollout supervision for the adapter's own proposal candidates.

Fixes:

  • train/build_aligned_proposal_dataset.py
    • now saves proposal_target_rollout_* tensors
  • sim_reveal/dataset.py
    • now loads proposal rollout targets
  • train/losses.py
    • transition loss now prefers proposal-aligned rollout targets when present
  • tests/test_transition_alignment_targets.py
    • verifies proposal rollout targets are selected over teacher candidate rollouts

3. Lightweight transition model bugs

While enabling rollout training, multiple contract bugs surfaced and were fixed:

  • bad clearance_field broadcast in models/world_model.py
  • bad hidden-state expansion across proposal candidates in models/world_model.py
  • unsafe .view() on non-contiguous proposal_mode_ids
  • rollout loss did not resize corridor / spatial rollout targets to lightweight field resolution

Tests added:

  • tests/test_lightweight_transition_contract.py
  • tests/test_transition_rollout_loss_resizing.py

Guardrail Test Status

Latest regression slice:

  • 14 passed, 1 warning

This included:

  • no-op equivalence
  • adapter gate behavior
  • task-specific loss masking
  • cloth metric selection
  • eval protocol identity
  • checkpoint remap
  • dataset alignment
  • transition alignment
  • lightweight transition contract
  • rollout target resizing

Proxy Benchmark Results

Benchmark setup:

  • benchmark mode: sprint
  • episodes per proxy: 8
  • total episodes: 24
  • proxies: foliage_proxy, bag_proxy, cloth_proxy

Rank-only adapter on aligned proposal targets

  • active:
    • mean success: 0.0
    • visibility integral: 0.15931496916649243
    • corridor availability: 0.0015432098880410194
    • disturbance cost: 0.6779018906719011
    • premature retrieve rate: 0.8270833333333334
    • planner regret: 0.0006857388885691762
  • noop:
    • mean success: 0.0
    • visibility integral: 0.159542116879796
    • corridor availability: 0.0015432098880410194
    • disturbance cost: 0.6762562873351642
    • premature retrieve rate: 0.8354166666666667
    • planner regret: 0.046383516304194926

Behavior:

  • non-base proposal usage: about 44.6% of steps
  • families selected: lift_edge, pin_left_rim, sweep_left

Conclusion:

  • selection collapse was fixed
  • planner regret improved sharply
  • reveal metrics did not improve

Base-fast adapter on aligned proposal targets

  • active:
    • mean success: 0.0
    • visibility integral: 0.15862687141634524
    • corridor availability: 0.0015432098880410194
    • disturbance cost: 0.6857880518323441
    • premature retrieve rate: 0.7984375
    • planner regret: 0.0015697095737171672
  • noop:
    • mean success: 0.0
    • visibility integral: 0.159542116879796
    • corridor availability: 0.0015432098880410194
    • disturbance cost: 0.6762562873351642
    • premature retrieve rate: 0.8354166666666667
    • planner regret: 0.046383516304194926

Behavior:

  • non-base proposal usage: 100% of steps
  • per-task collapse:
    • foliage -> sweep_left
    • bag -> pin_left_rim
    • cloth -> lift_edge

Conclusion:

  • proposal set changed aggressively
  • premature retrieve improved
  • visibility did not improve
  • disturbance worsened

Transition-fast adapter on aligned proposal + rollout targets

  • active:
    • mean success: 0.0
    • visibility integral: 0.15848870722887418
    • corridor availability: 0.0015432098880410194
    • disturbance cost: 0.6893061758801274
    • premature retrieve rate: 0.8203125
    • planner regret: 0.0012374107202049345
  • noop:
    • mean success: 0.0
    • visibility integral: 0.159542116879796
    • corridor availability: 0.0015432098880410194
    • disturbance cost: 0.6762562873351642
    • premature retrieve rate: 0.8354166666666667
    • planner regret: 0.046383516304194926

Behavior:

  • non-base proposal usage: about 33.3% of steps
  • dominant non-base family: lift_edge

Conclusion:

  • rollout alignment and transition training now work end-to-end
  • they still do not produce a reveal-quality gain on this proxy slice

Main Conclusion

The current adapter stack is now much better instrumented and several silent training/evaluation bugs were removed. That work was necessary.

However, after fixing:

  • proposal-target alignment,
  • proposal-rollout alignment,
  • transition-model contract bugs,
  • rollout-loss resizing bugs,

the proxy benchmark still does not clear the intended criterion:

  • no measurable success gain
  • no visibility or corridor gain over noop
  • only modest reduction in premature retrieve rate
  • planner regret improves, but execution quality does not

So the current answer is:

  • the no-op-safe adapter path is now valid software
  • the current light adapter variants still do not show a convincing novelty win on the proxy benchmark
  • the likely next research move is not another small tuning pass, but a change in what is being optimized or proposed

RLBench Status

I did not claim live RLBench parity from this machine.

Current blockers on this machine:

  • RLBench / PyRep / Coppelia environment is not installed
  • the local subset3 demo roots are not present
  • earlier repo notes already showed most old RLBench tasks were faulty on the prior setup except dual_push_buttons

So the general-task no-regression story remains:

  • code-level no-op parity tests are passing
  • historical dual_push_buttons anchor evidence exists in repo artifacts
  • a fresh live pushbuttons rerun was not possible in this environment

Recommended Next Move

If continuing from here, the next useful step is:

  1. keep the current bug fixes
  2. stop spending time on more short proxy tuning of this exact stack
  3. either:
    • redesign proposal generation so oracle-good reveal candidates are easier to separate early, or
    • shift to a stronger trunk / task-routed adapter variant and re-run the same aligned proxy protocol

The current iteration establishes a clean negative result on the present fast adapter variants, which is still valuable.