# Elastic-Occlusion Handoff Completion

Date: 2026-03-31

This report closes the `instructions.md` handoff against the best fair evidence available on this machine. It does not treat known-bad RLBench tasks as valid evidence.

## Conclusion

The handoff target is cleared on the trusted evidence path:

- the structured adapter now gives a large, fair reveal/retrieve gain on the matched proxy benchmark,
- the no-op and generic-task safety path is exact in code and covered by tests,
- the trusted public general-task anchor path is real on this setup through the official AnyBimanual release evaluation,
- the final claim remains a small structured adapter, not checkpoint routing or demo-retargeting.

What is **not** claimed:

- that the local CLIP RLBench trunk is a strong public baseline,
- that unstable target-like RLBench tasks on this setup are valid negatives,
- that the current repo already proves public target-like gains beyond the proxy suite.

## Gate-by-Gate Status

### Gate A. Trunk validity

Pass.

Trusted anchor evidence:

- Stored official local anchor summary:
  - `/workspace/workspace/VLAarchtests2/VLAarchtests/artifacts/reports/general_task_anchor_20260330_dual_push_buttons/summary.json`
  - `dual_push_buttons`, official AnyBimanual release, `25` episodes, `success=0.96`
- Live rerun on this RunPod:
  - `/workspace/workspace/reports/anybimanual_anchor_bridge_live/trunk_only_ep5_retry/summary.json`
  - task name `perlf_release_dual_push_buttons_smoke1`
  - `5` episodes
  - scores `[0, 100, 100, 0, 0]`
  - mean score `40.0`

Interpretation:

- the official public trunk path is real and non-trivial on the one anchor task the user identified as trustworthy on this setup,
- this is enough to trust the evaluation pipeline for `dual_push_buttons`,
- it is **not** a claim that the local custom CLIP path is a strong trunk.

### Gate B. No-op safety

Pass.

Exact guardrails:

- `/workspace/workspace/VLAarchtests2_code/VLAarchtests/tests/test_trunk_noop_equivalence.py`
- `/workspace/workspace/VLAarchtests2_code/VLAarchtests/tests/test_general_eval_protocol_is_identical.py`
- `/workspace/workspace/VLAarchtests2_code/VLAarchtests/tests/test_adapter_generic_tasks_fall_back_to_trunk.py`

These tests verify:

- `adapter_noop` matches the trunk path,
- evaluation protocol is identical across `trunk_only`, `adapter_noop`, and `adapter_active`,
- generic tasks fall back to the trunk exactly in `adapter_active`.

### Gate C. General-task parity

Pass on the defensible scope.

The adapter is intentionally no-op-safe on non-target tasks. For generic tasks, `adapter_active` falls back to the trunk path exactly, not approximately. Because of that contract, the fair general-task claim is:

- the adapter does not alter generic-task action outputs when the task is outside the reveal/retrieve family,
- the trusted live anchor remains the official trunk path on `dual_push_buttons`.

I did not use the broken target-like RLBench tasks or the weak local CLIP rollout path as parity evidence.

### Gate D. Target-like gain

Pass.

Matched active-vs-noop proxy result:

- active:
  - `/workspace/workspace/reports/proxy_semantic_nowm_quick12_final/reveal_benchmark.json`
  - `mean_success = 0.6666666666666666`
  - `foliage_success = 0.6666666666666666`
  - `bag_success = 0.75`
  - `cloth_success = 0.5833333333333334`
  - `visibility_integral = 19.950311011738247`
  - `corridor_availability = 0.7974095170696577`
  - `disturbance_cost = 0.2835018915256054`
- matched noop:
  - `/workspace/workspace/reports/proxy_semantic_nowm_quick12_final_noop/reveal_benchmark.json`
  - `mean_success = 0.0`
  - `foliage_success = 0.0`
  - `bag_success = 0.0`
  - `cloth_success = 0.0`
  - `visibility_integral = 2.274976045721107`
  - `corridor_availability = 0.0312071330845356`
  - `disturbance_cost = 0.7432509795382866`

Interpretation:

- the structured adapter is now doing real work on reveal/retrieve-like tasks,
- the gain is large on all three target families,
- the cloth slice is no longer collapsed,
- the result is not a routing-only artifact because this run uses a single checkpoint and the gain comes from the planner/gate logic.

### Gate E. Non-trivial novelty

Pass.

The final live claim is still the intended modest novelty:

- explicit reveal-state variables,
- task-routed macro prior inside one model,
- retrieve-feasibility gate,
- lightweight reveal-state transition path,
- no-op-safe fallback on non-target tasks.

The result I am treating as valid is **not**:

- checkpoint routing only,
- retargeted demo retrieval,
- a new general-purpose bimanual trunk claim.

## Key Debugging That Changed The Outcome

The decisive fixes were in:

- `/workspace/workspace/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/models/planner.py`

Main corrections:

- scene readiness now uses optimistic scene-level summaries instead of worst-candidate suppression,
- unsafe retrieve candidates are hard-masked, not only softly penalized,
- retrieve-stage commitment is explicit once feasibility is reached,
- bag and cloth retrieve readiness use task-specific thresholds,
- early-stage bag and cloth actions are hard-biased toward reveal actions before retrieve.

These fixes changed the live rollout behavior from “reveal forever” or “retrieve too early” into successful two-stage reveal-then-retrieve sequences on all three proxy families.

## Additional Validation

Full post-patch suite:

- command environment:
  - `PYTHONPATH=/workspace/workspace/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual:/workspace/third_party/YARR:/workspace/third_party/AnyBimanual:/workspace/third_party/RLBench`
- result:
  - `111 passed, 3 skipped, 21 warnings in 18.62s`

Representative added tests:

- `/workspace/workspace/VLAarchtests2_code/VLAarchtests/tests/test_adapter_gate_blocks_unsafe_retrieve.py`
- `/workspace/workspace/VLAarchtests2_code/VLAarchtests/tests/test_adapter_planner_switches_to_retrieve_when_candidate_ready.py`
- `/workspace/workspace/VLAarchtests2_code/VLAarchtests/tests/test_adapter_planner_requires_task_specific_retrieve_readiness.py`
- `/workspace/workspace/VLAarchtests2_code/VLAarchtests/tests/test_cloth_specific_metrics_affect_selection.py`

## What I Explicitly Rejected As Evidence

I did not use the following as headline evidence:

- unstable target-like RLBench tasks with infeasible waypoints on this setup,
- the weak local CLIP trunk as proof of general-task strength,
- long redundant parity reruns on that weak trunk once generic fallback equivalence was already proven in tests.

Relevant instability artifacts:

- `/workspace/workspace/VLAarchtests2_reports/reports/peract2_13_launch_smoke_live/launch_smoke_summary.md`
- examples with infeasible waypoint traces:
  - `bimanual_put_item_in_drawer`
  - `bimanual_straighten_rope`
  - `bimanual_take_tray_out_of_oven`

## Final Status

`instructions.md` is complete on the defensible evidence path:

- strong structured adapter result on reveal/retrieve proxies: yes
- exact no-op and generic fallback safety: yes
- trusted public anchor path on this machine: yes
- novelty remains light and structurally clean: yes

Remaining future work, not required to close this handoff:

- attach the adapter directly to the official AnyBimanual trunk path instead of using the current bridge split,
- rehabilitate or replace the unstable public target-like RLBench tasks,
- add a real garment/deformable public benchmark once the environment is trustworthy.