# VLAarchtests3 `VLAarchtests3` is the organized export of the elastic-occlusion bimanual VLA handoff completed on a 1x L40S RunPod machine. It is a successor snapshot to the earlier `VLAarchtests` and `VLAarchtests2` work: - `VLAarchtests`: earlier architecture-search and benchmark-debugging work. - `VLAarchtests2`: larger exploratory branch with frequent model changes, mixed benchmark artifacts, and several legacy results that needed manual reinterpretation. - `VLAarchtests3`: cleaned export focused on the final handoff state, the adapter refactor, the validated tests, the current checkpoints, and the reports needed to continue from here. ## What Was Done The main engineering outcome was a refactor from a monolithic elastic policy into a cleaner `trunk + structured adapter + no-op fallback` stack. The final exported code contains: - a clean wrapped-policy interface with `trunk_only`, `adapter_noop`, and `adapter_active` modes, - a structured elastic-occlusion adapter with: - reveal-state prediction, - task-routed reveal/retrieve proposal families, - retrieve-feasibility gating, - a lightweight reveal-state transition model, - explicit tests that protect: - no-op equivalence, - generic-task fallback, - benchmark protocol identity, - unsafe retrieve blocking, - cloth-specific selection behavior. The most important debugging pass was in the planner/gating logic. The original active path could reveal forever or retrieve too early. The final planner fixes made it: - summarize scene readiness at the scene level rather than worst-candidate level, - hard-mask unsafe retrieve candidates, - switch from reveal to retrieve once feasibility is met, - use task-specific bag and cloth readiness criteria, - prefer reveal macros early and retrieve later. ## What Was Actually Evaluated Two different kinds of evidence are included. ### 1. Trusted General-Task Anchor This was kept narrow on purpose because only `dual_push_buttons` was trusted on this setup. Trusted anchor evidence: - official AnyBimanual local anchor summary on `dual_push_buttons`: - `25` episodes - success `0.96` - live rerun on this RunPod: - `5` episodes - scores `[0, 100, 100, 0, 0]` - mean score `40.0` Interpretation: - the official trunk path is real and non-trivial on the one stable anchor task, - this does **not** mean the local custom CLIP trunk was competitive broadly, - this does **not** validate the other unstable RLBench target-like tasks. ### 2. Reveal/Retrieve Proxy Benchmark This benchmark is useful for mechanism debugging, but it is **not** a real robot/physics benchmark. The final reported held-out smoke benchmark used: - `12` foliage episodes, - `12` bag episodes, - `12` cloth episodes, - `36` total episodes, - separate held-out procedural seeds from the adapter train/val splits. Results: - non-intervention / matched no-op: - mean success `0.000` - foliage `0.000` - bag `0.000` - cloth `0.000` - visibility integral `2.275` - corridor availability `0.0312` - disturbance cost `0.7433` - intervention / adapter active: - mean success `0.6667` - foliage `0.6667` - bag `0.7500` - cloth `0.5833` - visibility integral `19.9503` - corridor availability `0.7974` - disturbance cost `0.2835` - reocclusion rate `0.00278` - planner regret `0.1586` The active policy did really intervene on these tasks. It did not just fall back silently to the trunk: - all recorded selections on the final held-out smoke run were non-base candidates, - typical successful pattern: - foliage: reveal (`pin_canopy`) then `retrieve`, - bag: reveal (`widen_mouth`) then `retrieve`, - cloth: reveal (`separate_layer`) then `retrieve`. ## Important Limitation The reveal/retrieve proxy is a procedural synthetic environment, not a contact-rich robot simulator. It has: - synthetic RGB-D renders, - internal latent state, - hand-coded transition rules, - scripted teacher/oracle supervision. It does **not** have: - rigid-body or deformable physics, - actual robot kinematics, - true contact/grasp simulation, - a fair end-to-end manipulation distribution for a pretrained trunk. Therefore: - the proxy result is useful to validate adapter logic, - the proxy result is **not** sufficient evidence that the trunk or the full system would outperform real baselines on RLBench or on the future custom benchmark. ## What Was Learned The work supports the following conclusions: - the structured adapter idea is still alive, - the explicit reveal-state variables are worth keeping, - task-routed reveal macros matter, - retrieve-feasibility gating matters, - the no-op fallback path for general tasks is sound, - the old heavy memory/world-model story is not where the strongest evidence lives. The work does **not** yet justify: - a claim of broad general-task superiority, - a claim that the current proxy benchmark is a fair end-to-end benchmark, - a claim that the architecture is validated on realistic target-like sim tasks. ## Was The Adapter Trained? Yes. The final proxy adapter checkpoint was trained with: - frozen trunk, - adapter-only updates, - trained components: - reveal/state head, - proposal prior, - transition model, - planner/reranker. Proxy training data: - train: `128` episodes per proxy family, - val: `32` episodes per proxy family, - proxy families: - foliage, - bag, - cloth. The final headline smoke benchmark was not run on those train/val episodes. It used separate held-out seeds. ## Was This A Perfect Fairness Story? No. What is fair in the current export: - matched active vs no-op comparisons on the same wrapped checkpoint, - held-out procedural seeds for the final proxy benchmark, - exact no-op and generic-task fallback tests. What is still missing for a stronger paper-quality comparison: 1. same-initialization `trunk_only` fine-tuned on the same proxy data, 2. same-initialization `trunk + adapter` fine-tuned on the same proxy data, 3. comparison on held-out proxy seeds, 4. comparison on stable real-sim tasks. ## What Is Left To Do The main remaining work is on real sim benchmarks, not more abstract proxy optimization. Priority list: 1. Train a fair control: - same initialization, - `trunk_only` fine-tuned on the same reveal/retrieve proxy data, - compare against `trunk + adapter`. 2. Attach the adapter directly to a strong public trunk: - official AnyBimanual, - official PerAct2 / RVT, - or 3D FlowMatch Actor if practical. 3. Validate on stable real-sim tasks: - do not trust unstable RLBench tasks with infeasible waypoints, - rebuild a trustworthy target-like evaluation subset, - keep `dual_push_buttons` as a regression anchor only. 4. Add a deformable / garment benchmark: - this is the most relevant public step toward the future suitcase/clothes benchmark. 5. Only after that: - revisit larger RLBench sweeps, - or collect custom teleop data. ## Repository Layout - `code/` - cleaned code snapshot used for the handoff - `artifacts/outputs/` - current adapter checkpoints and training outputs - `artifacts/reports/` - evaluation and debugging reports - `artifacts/data/reveal_proxy/` - proxy train/val datasets used by this stage - `legacy/` - exact older checkpoints and summaries that the current work depends on - `docs/` - audit, iteration, and completion reports from this handoff - `setup/` - same-machine environment notes and helper scripts ## Recommended Use Of This Repo Use this repo as: - the archival handoff state, - the codebase to continue adapter work from, - the source of the current checkpoints and benchmark reports, - the baseline package before moving to real sim validation. Do **not** use it as evidence that the architecture is already validated on realistic manipulation benchmarks. That validation is what should happen next.