VLAarchtests3 / docs /elastic_occlusion_iteration_2026-03-31.md

lsnu

Add files using upload-large-folder tool

aa584de verified 22 days ago

preview code

raw

history blame contribute delete

7.25 kB

Elastic Occlusion Iteration Report

Date: 2026-03-31 UTC

Scope

This iteration focused on the trunk + adapter path in:

/workspace/workspace/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual

The target was to verify whether the adapter could show a light novelty signal on the proxy benchmark without breaking the no-op-safe trunk path.

What Was Fixed

1. Proposal-target alignment bug

The original fast adapter runs were training against teacher shortlist labels, not the adapter's own proposal set.

Observed failure:

candidate_utility in the fast proxy dataset always had oracle argmax at slot 0
adapter training therefore learned to prefer base_action

Fixes:

train/run_experiment.py
- now rebuilds adapter datasets when proposal-aligned targets are missing
train/build_aligned_proposal_dataset.py
- now supports adapter-wrapped models
tests/test_adapter_dataset_alignment.py
- added regression tests for missing aligned targets

Result:

rebuilt aligned train dataset no longer collapses to slot 0
aligned oracle winners are non-base proposals across tasks

2. Proposal-rollout alignment for transition training

The lightweight transition path originally had no aligned rollout supervision for the adapter's own proposal candidates.

Fixes:

train/build_aligned_proposal_dataset.py
- now saves proposal_target_rollout_* tensors
sim_reveal/dataset.py
- now loads proposal rollout targets
train/losses.py
- transition loss now prefers proposal-aligned rollout targets when present
tests/test_transition_alignment_targets.py
- verifies proposal rollout targets are selected over teacher candidate rollouts

3. Lightweight transition model bugs

While enabling rollout training, multiple contract bugs surfaced and were fixed:

bad clearance_field broadcast in models/world_model.py
bad hidden-state expansion across proposal candidates in models/world_model.py
unsafe .view() on non-contiguous proposal_mode_ids
rollout loss did not resize corridor / spatial rollout targets to lightweight field resolution

Tests added:

tests/test_lightweight_transition_contract.py
tests/test_transition_rollout_loss_resizing.py

Guardrail Test Status

Latest regression slice:

14 passed, 1 warning

This included:

no-op equivalence
adapter gate behavior
task-specific loss masking
cloth metric selection
eval protocol identity
checkpoint remap
dataset alignment
transition alignment
lightweight transition contract
rollout target resizing

Proxy Benchmark Results

Benchmark setup:

benchmark mode: sprint
episodes per proxy: 8
total episodes: 24
proxies: foliage_proxy, bag_proxy, cloth_proxy

Rank-only adapter on aligned proposal targets

active:
- mean success: 0.0
- visibility integral: 0.15931496916649243
- corridor availability: 0.0015432098880410194
- disturbance cost: 0.6779018906719011
- premature retrieve rate: 0.8270833333333334
- planner regret: 0.0006857388885691762
noop:
- mean success: 0.0
- visibility integral: 0.159542116879796
- corridor availability: 0.0015432098880410194
- disturbance cost: 0.6762562873351642
- premature retrieve rate: 0.8354166666666667
- planner regret: 0.046383516304194926

Behavior:

non-base proposal usage: about 44.6% of steps
families selected: lift_edge, pin_left_rim, sweep_left

Conclusion:

selection collapse was fixed
planner regret improved sharply
reveal metrics did not improve

Base-fast adapter on aligned proposal targets

active:
- mean success: 0.0
- visibility integral: 0.15862687141634524
- corridor availability: 0.0015432098880410194
- disturbance cost: 0.6857880518323441
- premature retrieve rate: 0.7984375
- planner regret: 0.0015697095737171672
noop:
- mean success: 0.0
- visibility integral: 0.159542116879796
- corridor availability: 0.0015432098880410194
- disturbance cost: 0.6762562873351642
- premature retrieve rate: 0.8354166666666667
- planner regret: 0.046383516304194926

Behavior:

non-base proposal usage: 100% of steps
per-task collapse:
- foliage -> sweep_left
- bag -> pin_left_rim
- cloth -> lift_edge

Conclusion:

proposal set changed aggressively
premature retrieve improved
visibility did not improve
disturbance worsened

Transition-fast adapter on aligned proposal + rollout targets

active:
- mean success: 0.0
- visibility integral: 0.15848870722887418
- corridor availability: 0.0015432098880410194
- disturbance cost: 0.6893061758801274
- premature retrieve rate: 0.8203125
- planner regret: 0.0012374107202049345
noop:
- mean success: 0.0
- visibility integral: 0.159542116879796
- corridor availability: 0.0015432098880410194
- disturbance cost: 0.6762562873351642
- premature retrieve rate: 0.8354166666666667
- planner regret: 0.046383516304194926

Behavior:

non-base proposal usage: about 33.3% of steps
dominant non-base family: lift_edge

Conclusion:

rollout alignment and transition training now work end-to-end
they still do not produce a reveal-quality gain on this proxy slice

Main Conclusion

The current adapter stack is now much better instrumented and several silent training/evaluation bugs were removed. That work was necessary.

However, after fixing:

proposal-target alignment,
proposal-rollout alignment,
transition-model contract bugs,
rollout-loss resizing bugs,

the proxy benchmark still does not clear the intended criterion:

no measurable success gain
no visibility or corridor gain over noop
only modest reduction in premature retrieve rate
planner regret improves, but execution quality does not

So the current answer is:

the no-op-safe adapter path is now valid software
the current light adapter variants still do not show a convincing novelty win on the proxy benchmark
the likely next research move is not another small tuning pass, but a change in what is being optimized or proposed

RLBench Status

I did not claim live RLBench parity from this machine.

Current blockers on this machine:

RLBench / PyRep / Coppelia environment is not installed
the local subset3 demo roots are not present
earlier repo notes already showed most old RLBench tasks were faulty on the prior setup except dual_push_buttons

So the general-task no-regression story remains:

code-level no-op parity tests are passing
historical dual_push_buttons anchor evidence exists in repo artifacts
a fresh live pushbuttons rerun was not possible in this environment

Recommended Next Move

If continuing from here, the next useful step is:

keep the current bug fixes
stop spending time on more short proxy tuning of this exact stack
either:
- redesign proposal generation so oracle-good reveal candidates are easier to separate early, or
- shift to a stronger trunk / task-routed adapter variant and re-run the same aligned proxy protocol

The current iteration establishes a clean negative result on the present fast adapter variants, which is still valuable.