Elastic-Occlusion Bimanual VLA Audit
Date: 2026-03-31
Repo audited: lsnu/VLAarchtests2
Snapshot used for this audit:
- Hugging Face repo SHA:
42b66a34eab9b7425a3a25003db808e1dd93b905 - Hub
last_modified:2026-03-31T01:19:56+00:00 - Local mirror root:
/workspace/workspace/VLAarchtests2 - Code-focused mirror:
/workspace/workspace/VLAarchtests2_code - Reports-focused mirror:
/workspace/workspace/VLAarchtests2_reports
This audit follows /workspace/instructions.md, which explicitly says the goal is not to invent a new general-purpose trunk. The goal is to attach a small structured adapter to a strong public bimanual trunk, preserve general-task competence, and make the novelty live in reveal/retrieve structure.
Bottom Line
The repo does not currently show that the latest full architecture is a competitive general bimanual policy.
It does show that the reveal/retrieve decomposition is worth keeping.
My direct recommendation is:
- keep the explicit reveal-state idea,
- keep the task-routed reveal proposal vocabulary,
- keep the retrieve-feasibility gate,
- stop treating the current memory stack and token-heavy world model as default requirements,
- stop treating the current local CLIP/RVT path as the scientific center,
- move to a strong public trunk and make the novelty a small adapter above it.
The last non-zero RLBench-style result is not fake, but it is not the architectural win you need. It is a retrieval/retargeting positive control, not evidence that the current elastic architecture is broadly competitive.
What The Current Code Actually Is
The current latest elastic policy is ElasticRevealBimanualPolicy in:
/workspace/workspace/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/models/policy.py
At lines 524-531 it instantiates:
DualObservationMemorySymmetricCoordinatedChunkDecoderElasticOcclusionStateHeadElasticOcclusionWorldModelCascadePlanner
So the latest path is a monolithic stack, not a small adapter.
The strongest part of the repo is the reveal-state representation in:
/workspace/workspace/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/models/reveal_head.py
The task metrics at lines 12-28 and their derived definitions at lines 78-98 already align unusually well with the intended real tasks:
insertable_actor_corridorlayer_separation_qualityfold_preservationtop_layer_stabilitylift_too_much_risk
This is the best scientific signal in the whole codebase.
The action decoder in:
/workspace/workspace/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/models/action_decoder.py
contains explicit task-routed proposal families. The current repo really does encode task-specific reveal/retrieve macro structure rather than only generic action sampling. This is a good fit for foliage, bag, and cloth/suitcase tasks.
The planner in:
/workspace/workspace/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/models/planner.py
contains real retrieve-feasibility blocking. At lines 421-434, retrieve-like modes are penalized when access or persistence is too low, support is too low, or reocclusion is too high. This is one of the most defensible pieces of structure in the repo.
What The Current Code Does Not Show
The current repo does not show that:
- the latest full elastic policy is a strong general bimanual policy,
- the heavy memory stack helps,
- the heavy world model helps,
- the custom RVT branch is a faithful enough benchmark path to serve as the main scientific trunk.
The default backbone config in:
/workspace/workspace/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/models/backbones.py
still says:
backbone_type: "clip"at line 20model_name: "openai/clip-vit-base-patch32"at line 21
The RVT path exists, but it is a custom adapter with hard-coded scene bounds at lines 39-46. That is useful engineering work, but not yet a benchmark-faithful enough negative verdict on RVT itself.
Also important: the strongest recent proxy checkpoints are still CLIP-based and were run with the world model disabled. In:
/workspace/workspace/VLAarchtests2/VLAarchtests/artifacts/outputs/r3d_handoff_phase/proxy_interaction_r3d_stage3_clip_rgbd_handoff_compact_phase_v7_selector_finetune_iter6_seed17/config_resolved.yaml
the resolved config shows:
policy_type: elastic_revealuse_world_model: falsemodel_name: openai/clip-vit-base-patch32
So the codebase contains a large world-model path, but the best proxy checkpoints were not actually validating that full path.
What The Tests Really Validate
The main fixtures in:
/workspace/workspace/VLAarchtests2_code/VLAarchtests/tests/conftest.py
use tiny settings:
- hidden dim
16 - chunk size
2 - field size
4 - random
16x16RGB-D - dummy backbone
This is good for contract testing, not policy competence.
My local short validation on the copied snapshot:
- command:
pytest -q test_proxy_scripted_bench.py test_candidate_ranking_loss.py test_policy_topk_cascade.py test_task_routed_model_eval.py - result:
15 passed, 2 warnings in 1.34s
That means the copied snapshot is internally consistent for small contract and proxy checks. It does not mean the policy is benchmark-strong.
What The Proxy Reports Actually Say
The most important proxy report is:
/workspace/workspace/VLAarchtests2_reports/VLAarchtests/artifacts/reports/sprint_v7_summary/reveal_sprint_summary.md
Main numbers:
random:0.433oracle:0.407base_model:0.280no_planner:0.200no_memory:0.323no_task_conditioning:0.280no_geometry:0.270- cloth for
base_model:0.140
Interpretation:
- the learned controller is below random on its own candidate set,
- planner matters,
- memory looks harmful or at least unproven,
- task conditioning is flat in the checkpoint,
- geometry helps only modestly,
- cloth is the clearest ranking/utility failure case.
The follow-up debug report is even more revealing:
/workspace/workspace/VLAarchtests2_reports/VLAarchtests/artifacts/reports/sprint_v7_followup/deep_debug_summary.md
It shows:
- planner on teacher-supplied candidates is healthy,
- the dominant live failure is proposal-logit shortlisting,
- cloth oracle-best candidate is excluded from shortlist
85%of the time, - removing shortlist or ignoring proposal logits gives a large improvement,
- cloth oracle ceiling rises sharply after a utility correction.
This is a strong signal that the structural reveal idea is not dead. The selector path is the bigger problem.
The best proxy controller in the repo is the task-routed controller:
/workspace/workspace/VLAarchtests2_reports/VLAarchtests/artifacts/reports/task_routed_proxy_v1/summary.md
Numbers:
- mean success
0.4867 - foliage
0.46 - bag
0.41 - cloth
0.59
This is useful evidence that task-specific bias matters. It is not evidence that one clean unified model already solved the problem.
What The General-Task Reports Actually Say
The current general-task anchor result is weak:
/workspace/workspace/VLAarchtests2_reports/VLAarchtests/artifacts/reports/general_task_anchor_20260330_dual_push_buttons/summary.md
It shows:
- public AnyBimanual release: success
0.960 - local official AnyBimanual eval: success
0.960 - local clip backbone-only:
0.000 - local elastic reveal proxy iter6:
0.000 - local RVT frozen fixed-bounds:
0.000
That is enough to say the current local custom path is not yet a valid scientific base for claims about general bimanual competence.
Was The Non-Zero RLBench Result Real?
The answer is:
- real as a positive control,
- not real as evidence that the elastic architecture is competitive on general RLBench tasks.
The relevant report is:
/workspace/workspace/VLAarchtests2_reports/VLAarchtests/artifacts/reports/dual_push_nonzero_branch_20260330/summary.md
It shows:
- direct rollout smoke:
0.0 - controller sweep:
0.0 - weighted rollout smoke:
0.0 - chunk-supervised probe:
0.0 - retargeted demo variants:
1.0
The later hybrid path makes the mechanism explicit. In:
/workspace/workspace/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/eval/run_rlbench_dual_push_full_arch_hybrid_eval.py/workspace/workspace/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/eval/dual_push_full_arch_utils.py
the evaluation:
- builds a demo feature bank,
- retrieves the nearest demo,
- retargets demo poses to live button locations,
- creates hybrid candidates including
retargeted_demo_baseandretargeted_demo_bridge, - lets the planner choose among these hybrid candidates and residualized controller variants.
So the non-zero line is not "cheating" in the narrow sense. But it is not the architecture you want to publish. It is hybrid demo retrieval plus retargeting.
My conclusion: do not treat this as proof that the current elastic policy is ready for a full RLBench sweep.
Direct Answers To The Main Questions
1. Do the tests invalidate the structural idea?
No.
They invalidate some implementation choices, especially:
- current learned shortlist/logit selector,
- current memory stack,
- current validation story for the heavy world model.
They do not invalidate the core reveal/retrieve structure.
2. Should the current architecture be pushed into a full RLBench sweep?
No.
Not before you first show:
- a strong public trunk baseline is reproduced fairly,
trunk + adapter_noopis no worse thantrunk,trunk + adapter_activehelps on reveal/retrieve-like public tasks or clean proxy tasks.
3. Was the last non-zero RLBench score a real win?
No, not as an architectural claim.
It is a useful positive control showing that the evaluation plumbing can succeed when demo retrieval and retargeting provide a strong base trajectory. That is different from showing the elastic occlusion architecture itself is strong.
4. Is the idea still potentially novel?
Yes, but only if the claim is narrowed.
The claim should not be:
- new general bimanual VLA,
- new general 3D trunk,
- new overall SOTA bimanual foundation model.
The claim should be:
- a structured adapter for reveal/retrieve under elastic occlusion on top of a strong public trunk,
- with explicit reveal-state prediction,
- task-routed reveal macros,
- retrieve-feasibility gating,
- and task-specific disturbance/fold-preservation awareness.
That is modestly novel and scientifically cleaner.
Literature Positioning
The strongest nearby general bimanual references I would use are:
- PerAct2 benchmark and baseline: https://arxiv.org/abs/2407.00278
- AnyBimanual: https://arxiv.org/abs/2412.06779
- 3D FlowMatch Actor: https://arxiv.org/abs/2508.11002
- RDT-1B: https://arxiv.org/abs/2410.07864
- CoFreeVLA: https://arxiv.org/abs/2601.21712
For the target task family, the most relevant references are:
- Vision in Action: https://arxiv.org/abs/2506.15666
- ActiveVLA: https://arxiv.org/abs/2601.08325
- Interactive Perception for Deformable Object Manipulation: https://arxiv.org/abs/2403.05177
- Bimanual Deformable Bag Manipulation Using a Structure-of-Interest Based Neural Dynamics Model: https://arxiv.org/abs/2401.11432
- Occlusion-Aware Search for Object Retrieval in Clutter: https://arxiv.org/abs/2011.03334
- GarmentLab: https://arxiv.org/abs/2411.01200
My synthesis from those sources:
- Active perception under occlusion is already a real literature thread.
- Bag-specific active reveal and bag structure modeling already exist.
- Generic bimanual baselines already include strong public systems.
- What still looks underexplored is disturbance-aware reveal/retrieve with explicit fold-preservation style structure for a suitcase/clothes setting.
That makes the clothes/suitcase task your strongest publication angle.
Recommended Architecture
Do not keep the current monolith as the target system.
Build:
- a strong public trunk,
- plus a small elastic-occlusion adapter.
Trunk choice
Order of preference:
- 3D FlowMatch Actor, if the official path is practical.
- Official PerAct2 or official RVT-style path.
- Official AnyBimanual if it is the fastest stable local path and you want the lowest engineering risk.
Adapter contents
Keep exactly four core pieces:
- reveal-state head,
- task-routed proposal prior,
- retrieve-feasibility gate,
- lightweight reveal-state transition model.
Default removals from the current monolith:
- remove heavy dual memory as a required dependency,
- remove full token-heavy world model as default,
- make both optional ablations rather than the baseline path.
Critical requirement
Add a true no-op mode:
adapter_offadapter_noopadapter_active
Without this, you cannot prove that the adapter preserves general competence.
Recommended Benchmark Strategy
Do not jump straight to a massive RLBench sweep on the current repo.
Use four stages:
Stage 1. Reproduce a strong public trunk
Pick one official trunk path and verify it locally on a small public anchor set.
Minimum anchor set:
bimanual_push_boxbimanual_lift_ballbimanual_dual_push_buttonsbimanual_handover_itembimanual_lift_tray
Goal:
- official numbers are approximately reproducible,
- your local evaluation path is trustworthy.
Stage 2. Prove no regression
Add adapter wiring with:
adapter_offadapter_noop
Goal:
trunk + adapter_noopmatchestrunkwithin noise on the anchor set.
Stage 3. Train only the structured adapter
Use public sim and clean proxy labels for:
- visibility gain,
- access corridor,
- persistence/support,
- reocclusion,
- disturbance,
- cloth fold-preservation style metrics when available.
Train the adapter with the trunk frozen or nearly frozen.
Stage 4. Evaluate on reveal/retrieve stress tasks
Use:
- the current proxy benchmark as a development instrument,
- PerAct2 bimanual tasks that stress containment/opening/retrieval,
- GarmentLab as soon as the stack is runnable.
For the paper story, you do not need to dominate all bimanual tasks. You need:
- same ballpark as strong baselines on general public tasks,
- clear gains on elastic-occlusion reveal/retrieve tasks.
What I Would Not Do Next
I would not:
- run a full RLBench sweep on the current monolithic elastic stack,
- spend more time trying to rescue CLIP as the scientific backbone,
- keep changing memory, planner, world model, and backbone all at once,
- claim the retargeted-demo hybrid result as proof of the full architecture.
What I Would Do Next
In order:
- Pick the public trunk to standardize on.
- Refactor the repo into
trunk,adapter, andwrapped policywith a real no-op path. - Port only the best structural parts:
- reveal-state metrics,
- task-routed proposal vocabulary,
- retrieve-feasibility gate.
- Make memory and world model optional ablations, not default requirements.
- Re-run the proxy benchmark only as a selector/utility-development tool.
- Move quickly to fair public trunk-preservation and reveal-task evaluations.
Final Recommendation
The project is still alive, but the win condition needs to change.
Do not try to prove that the current repo is already a new SOTA general bimanual VLA.
Do try to build a defensible paper around:
- a strong public bimanual trunk,
- plus a small structured elastic-occlusion adapter,
- with explicit reveal-state prediction and retrieve-feasibility control,
- validated by no-regression on public bimanual tasks and gains on reveal/retrieve tasks.
If you make that pivot now, the repo still contains enough good structure to become a credible research system.