YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
VLAarchtests3
VLAarchtests3 is the organized export of the elastic-occlusion bimanual VLA handoff completed on a 1x L40S RunPod machine.
It is a successor snapshot to the earlier VLAarchtests and VLAarchtests2 work:
VLAarchtests: earlier architecture-search and benchmark-debugging work.VLAarchtests2: larger exploratory branch with frequent model changes, mixed benchmark artifacts, and several legacy results that needed manual reinterpretation.VLAarchtests3: cleaned export focused on the final handoff state, the adapter refactor, the validated tests, the current checkpoints, and the reports needed to continue from here.
What Was Done
The main engineering outcome was a refactor from a monolithic elastic policy into a cleaner trunk + structured adapter + no-op fallback stack.
The final exported code contains:
- a clean wrapped-policy interface with
trunk_only,adapter_noop, andadapter_activemodes, - a structured elastic-occlusion adapter with:
- reveal-state prediction,
- task-routed reveal/retrieve proposal families,
- retrieve-feasibility gating,
- a lightweight reveal-state transition model,
- explicit tests that protect:
- no-op equivalence,
- generic-task fallback,
- benchmark protocol identity,
- unsafe retrieve blocking,
- cloth-specific selection behavior.
The most important debugging pass was in the planner/gating logic. The original active path could reveal forever or retrieve too early. The final planner fixes made it:
- summarize scene readiness at the scene level rather than worst-candidate level,
- hard-mask unsafe retrieve candidates,
- switch from reveal to retrieve once feasibility is met,
- use task-specific bag and cloth readiness criteria,
- prefer reveal macros early and retrieve later.
What Was Actually Evaluated
Two different kinds of evidence are included.
1. Trusted General-Task Anchor
This was kept narrow on purpose because only dual_push_buttons was trusted on this setup.
Trusted anchor evidence:
- official AnyBimanual local anchor summary on
dual_push_buttons:25episodes- success
0.96
- live rerun on this RunPod:
5episodes- scores
[0, 100, 100, 0, 0] - mean score
40.0
Interpretation:
- the official trunk path is real and non-trivial on the one stable anchor task,
- this does not mean the local custom CLIP trunk was competitive broadly,
- this does not validate the other unstable RLBench target-like tasks.
2. Reveal/Retrieve Proxy Benchmark
This benchmark is useful for mechanism debugging, but it is not a real robot/physics benchmark.
The final reported held-out smoke benchmark used:
12foliage episodes,12bag episodes,12cloth episodes,36total episodes,- separate held-out procedural seeds from the adapter train/val splits.
Results:
non-intervention / matched no-op:
- mean success
0.000 - foliage
0.000 - bag
0.000 - cloth
0.000 - visibility integral
2.275 - corridor availability
0.0312 - disturbance cost
0.7433
- mean success
intervention / adapter active:
- mean success
0.6667 - foliage
0.6667 - bag
0.7500 - cloth
0.5833 - visibility integral
19.9503 - corridor availability
0.7974 - disturbance cost
0.2835 - reocclusion rate
0.00278 - planner regret
0.1586
- mean success
The active policy did really intervene on these tasks. It did not just fall back silently to the trunk:
- all recorded selections on the final held-out smoke run were non-base candidates,
- typical successful pattern:
- foliage: reveal (
pin_canopy) thenretrieve, - bag: reveal (
widen_mouth) thenretrieve, - cloth: reveal (
separate_layer) thenretrieve.
- foliage: reveal (
Important Limitation
The reveal/retrieve proxy is a procedural synthetic environment, not a contact-rich robot simulator.
It has:
- synthetic RGB-D renders,
- internal latent state,
- hand-coded transition rules,
- scripted teacher/oracle supervision.
It does not have:
- rigid-body or deformable physics,
- actual robot kinematics,
- true contact/grasp simulation,
- a fair end-to-end manipulation distribution for a pretrained trunk.
Therefore:
- the proxy result is useful to validate adapter logic,
- the proxy result is not sufficient evidence that the trunk or the full system would outperform real baselines on RLBench or on the future custom benchmark.
What Was Learned
The work supports the following conclusions:
- the structured adapter idea is still alive,
- the explicit reveal-state variables are worth keeping,
- task-routed reveal macros matter,
- retrieve-feasibility gating matters,
- the no-op fallback path for general tasks is sound,
- the old heavy memory/world-model story is not where the strongest evidence lives.
The work does not yet justify:
- a claim of broad general-task superiority,
- a claim that the current proxy benchmark is a fair end-to-end benchmark,
- a claim that the architecture is validated on realistic target-like sim tasks.
Was The Adapter Trained?
Yes.
The final proxy adapter checkpoint was trained with:
- frozen trunk,
- adapter-only updates,
- trained components:
- reveal/state head,
- proposal prior,
- transition model,
- planner/reranker.
Proxy training data:
- train:
128episodes per proxy family, - val:
32episodes per proxy family, - proxy families:
- foliage,
- bag,
- cloth.
The final headline smoke benchmark was not run on those train/val episodes. It used separate held-out seeds.
Was This A Perfect Fairness Story?
No.
What is fair in the current export:
- matched active vs no-op comparisons on the same wrapped checkpoint,
- held-out procedural seeds for the final proxy benchmark,
- exact no-op and generic-task fallback tests.
What is still missing for a stronger paper-quality comparison:
- same-initialization
trunk_onlyfine-tuned on the same proxy data, - same-initialization
trunk + adapterfine-tuned on the same proxy data, - comparison on held-out proxy seeds,
- comparison on stable real-sim tasks.
What Is Left To Do
The main remaining work is on real sim benchmarks, not more abstract proxy optimization.
Priority list:
Train a fair control:
- same initialization,
trunk_onlyfine-tuned on the same reveal/retrieve proxy data,- compare against
trunk + adapter.
Attach the adapter directly to a strong public trunk:
- official AnyBimanual,
- official PerAct2 / RVT,
- or 3D FlowMatch Actor if practical.
Validate on stable real-sim tasks:
- do not trust unstable RLBench tasks with infeasible waypoints,
- rebuild a trustworthy target-like evaluation subset,
- keep
dual_push_buttonsas a regression anchor only.
Add a deformable / garment benchmark:
- this is the most relevant public step toward the future suitcase/clothes benchmark.
Only after that:
- revisit larger RLBench sweeps,
- or collect custom teleop data.
Repository Layout
code/- cleaned code snapshot used for the handoff
artifacts/outputs/- current adapter checkpoints and training outputs
artifacts/reports/- evaluation and debugging reports
artifacts/data/reveal_proxy/- proxy train/val datasets used by this stage
legacy/- exact older checkpoints and summaries that the current work depends on
docs/- audit, iteration, and completion reports from this handoff
setup/- same-machine environment notes and helper scripts
Recommended Use Of This Repo
Use this repo as:
- the archival handoff state,
- the codebase to continue adapter work from,
- the source of the current checkpoints and benchmark reports,
- the baseline package before moving to real sim validation.
Do not use it as evidence that the architecture is already validated on realistic manipulation benchmarks. That validation is what should happen next.