Elastic Occlusion Iteration Report
Date: 2026-03-31 UTC
Scope
This iteration focused on the trunk + adapter path in:
/workspace/workspace/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual
The target was to verify whether the adapter could show a light novelty signal on the proxy benchmark without breaking the no-op-safe trunk path.
What Was Fixed
1. Proposal-target alignment bug
The original fast adapter runs were training against teacher shortlist labels, not the adapter's own proposal set.
Observed failure:
candidate_utilityin the fast proxy dataset always had oracle argmax at slot0- adapter training therefore learned to prefer
base_action
Fixes:
train/run_experiment.py- now rebuilds adapter datasets when proposal-aligned targets are missing
train/build_aligned_proposal_dataset.py- now supports adapter-wrapped models
tests/test_adapter_dataset_alignment.py- added regression tests for missing aligned targets
Result:
- rebuilt aligned train dataset no longer collapses to slot
0 - aligned oracle winners are non-base proposals across tasks
2. Proposal-rollout alignment for transition training
The lightweight transition path originally had no aligned rollout supervision for the adapter's own proposal candidates.
Fixes:
train/build_aligned_proposal_dataset.py- now saves
proposal_target_rollout_*tensors
- now saves
sim_reveal/dataset.py- now loads proposal rollout targets
train/losses.py- transition loss now prefers proposal-aligned rollout targets when present
tests/test_transition_alignment_targets.py- verifies proposal rollout targets are selected over teacher candidate rollouts
3. Lightweight transition model bugs
While enabling rollout training, multiple contract bugs surfaced and were fixed:
- bad
clearance_fieldbroadcast inmodels/world_model.py - bad hidden-state expansion across proposal candidates in
models/world_model.py - unsafe
.view()on non-contiguousproposal_mode_ids - rollout loss did not resize corridor / spatial rollout targets to lightweight field resolution
Tests added:
tests/test_lightweight_transition_contract.pytests/test_transition_rollout_loss_resizing.py
Guardrail Test Status
Latest regression slice:
14 passed, 1 warning
This included:
- no-op equivalence
- adapter gate behavior
- task-specific loss masking
- cloth metric selection
- eval protocol identity
- checkpoint remap
- dataset alignment
- transition alignment
- lightweight transition contract
- rollout target resizing
Proxy Benchmark Results
Benchmark setup:
- benchmark mode:
sprint - episodes per proxy:
8 - total episodes:
24 - proxies:
foliage_proxy,bag_proxy,cloth_proxy
Rank-only adapter on aligned proposal targets
- active:
- mean success:
0.0 - visibility integral:
0.15931496916649243 - corridor availability:
0.0015432098880410194 - disturbance cost:
0.6779018906719011 - premature retrieve rate:
0.8270833333333334 - planner regret:
0.0006857388885691762
- mean success:
- noop:
- mean success:
0.0 - visibility integral:
0.159542116879796 - corridor availability:
0.0015432098880410194 - disturbance cost:
0.6762562873351642 - premature retrieve rate:
0.8354166666666667 - planner regret:
0.046383516304194926
- mean success:
Behavior:
- non-base proposal usage: about
44.6%of steps - families selected:
lift_edge,pin_left_rim,sweep_left
Conclusion:
- selection collapse was fixed
- planner regret improved sharply
- reveal metrics did not improve
Base-fast adapter on aligned proposal targets
- active:
- mean success:
0.0 - visibility integral:
0.15862687141634524 - corridor availability:
0.0015432098880410194 - disturbance cost:
0.6857880518323441 - premature retrieve rate:
0.7984375 - planner regret:
0.0015697095737171672
- mean success:
- noop:
- mean success:
0.0 - visibility integral:
0.159542116879796 - corridor availability:
0.0015432098880410194 - disturbance cost:
0.6762562873351642 - premature retrieve rate:
0.8354166666666667 - planner regret:
0.046383516304194926
- mean success:
Behavior:
- non-base proposal usage:
100%of steps - per-task collapse:
- foliage ->
sweep_left - bag ->
pin_left_rim - cloth ->
lift_edge
- foliage ->
Conclusion:
- proposal set changed aggressively
- premature retrieve improved
- visibility did not improve
- disturbance worsened
Transition-fast adapter on aligned proposal + rollout targets
- active:
- mean success:
0.0 - visibility integral:
0.15848870722887418 - corridor availability:
0.0015432098880410194 - disturbance cost:
0.6893061758801274 - premature retrieve rate:
0.8203125 - planner regret:
0.0012374107202049345
- mean success:
- noop:
- mean success:
0.0 - visibility integral:
0.159542116879796 - corridor availability:
0.0015432098880410194 - disturbance cost:
0.6762562873351642 - premature retrieve rate:
0.8354166666666667 - planner regret:
0.046383516304194926
- mean success:
Behavior:
- non-base proposal usage: about
33.3%of steps - dominant non-base family:
lift_edge
Conclusion:
- rollout alignment and transition training now work end-to-end
- they still do not produce a reveal-quality gain on this proxy slice
Main Conclusion
The current adapter stack is now much better instrumented and several silent training/evaluation bugs were removed. That work was necessary.
However, after fixing:
- proposal-target alignment,
- proposal-rollout alignment,
- transition-model contract bugs,
- rollout-loss resizing bugs,
the proxy benchmark still does not clear the intended criterion:
- no measurable success gain
- no visibility or corridor gain over noop
- only modest reduction in premature retrieve rate
- planner regret improves, but execution quality does not
So the current answer is:
- the no-op-safe adapter path is now valid software
- the current light adapter variants still do not show a convincing novelty win on the proxy benchmark
- the likely next research move is not another small tuning pass, but a change in what is being optimized or proposed
RLBench Status
I did not claim live RLBench parity from this machine.
Current blockers on this machine:
- RLBench / PyRep / Coppelia environment is not installed
- the local subset3 demo roots are not present
- earlier repo notes already showed most old RLBench tasks were faulty on the prior setup except
dual_push_buttons
So the general-task no-regression story remains:
- code-level no-op parity tests are passing
- historical
dual_push_buttonsanchor evidence exists in repo artifacts - a fresh live pushbuttons rerun was not possible in this environment
Recommended Next Move
If continuing from here, the next useful step is:
- keep the current bug fixes
- stop spending time on more short proxy tuning of this exact stack
- either:
- redesign proposal generation so oracle-good reveal candidates are easier to separate early, or
- shift to a stronger trunk / task-routed adapter variant and re-run the same aligned proxy protocol
The current iteration establishes a clean negative result on the present fast adapter variants, which is still valuable.