VLAarchtests3 / docs /elastic_occlusion_repo_audit_2026-03-31.md
lsnu's picture
Add files using upload-large-folder tool
aa584de verified

Elastic-Occlusion Bimanual VLA Audit

Date: 2026-03-31

Repo audited: lsnu/VLAarchtests2

Snapshot used for this audit:

  • Hugging Face repo SHA: 42b66a34eab9b7425a3a25003db808e1dd93b905
  • Hub last_modified: 2026-03-31T01:19:56+00:00
  • Local mirror root: /workspace/workspace/VLAarchtests2
  • Code-focused mirror: /workspace/workspace/VLAarchtests2_code
  • Reports-focused mirror: /workspace/workspace/VLAarchtests2_reports

This audit follows /workspace/instructions.md, which explicitly says the goal is not to invent a new general-purpose trunk. The goal is to attach a small structured adapter to a strong public bimanual trunk, preserve general-task competence, and make the novelty live in reveal/retrieve structure.

Bottom Line

The repo does not currently show that the latest full architecture is a competitive general bimanual policy.

It does show that the reveal/retrieve decomposition is worth keeping.

My direct recommendation is:

  • keep the explicit reveal-state idea,
  • keep the task-routed reveal proposal vocabulary,
  • keep the retrieve-feasibility gate,
  • stop treating the current memory stack and token-heavy world model as default requirements,
  • stop treating the current local CLIP/RVT path as the scientific center,
  • move to a strong public trunk and make the novelty a small adapter above it.

The last non-zero RLBench-style result is not fake, but it is not the architectural win you need. It is a retrieval/retargeting positive control, not evidence that the current elastic architecture is broadly competitive.

What The Current Code Actually Is

The current latest elastic policy is ElasticRevealBimanualPolicy in:

  • /workspace/workspace/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/models/policy.py

At lines 524-531 it instantiates:

  • DualObservationMemory
  • SymmetricCoordinatedChunkDecoder
  • ElasticOcclusionStateHead
  • ElasticOcclusionWorldModel
  • CascadePlanner

So the latest path is a monolithic stack, not a small adapter.

The strongest part of the repo is the reveal-state representation in:

  • /workspace/workspace/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/models/reveal_head.py

The task metrics at lines 12-28 and their derived definitions at lines 78-98 already align unusually well with the intended real tasks:

  • insertable_actor_corridor
  • layer_separation_quality
  • fold_preservation
  • top_layer_stability
  • lift_too_much_risk

This is the best scientific signal in the whole codebase.

The action decoder in:

  • /workspace/workspace/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/models/action_decoder.py

contains explicit task-routed proposal families. The current repo really does encode task-specific reveal/retrieve macro structure rather than only generic action sampling. This is a good fit for foliage, bag, and cloth/suitcase tasks.

The planner in:

  • /workspace/workspace/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/models/planner.py

contains real retrieve-feasibility blocking. At lines 421-434, retrieve-like modes are penalized when access or persistence is too low, support is too low, or reocclusion is too high. This is one of the most defensible pieces of structure in the repo.

What The Current Code Does Not Show

The current repo does not show that:

  • the latest full elastic policy is a strong general bimanual policy,
  • the heavy memory stack helps,
  • the heavy world model helps,
  • the custom RVT branch is a faithful enough benchmark path to serve as the main scientific trunk.

The default backbone config in:

  • /workspace/workspace/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/models/backbones.py

still says:

  • backbone_type: "clip" at line 20
  • model_name: "openai/clip-vit-base-patch32" at line 21

The RVT path exists, but it is a custom adapter with hard-coded scene bounds at lines 39-46. That is useful engineering work, but not yet a benchmark-faithful enough negative verdict on RVT itself.

Also important: the strongest recent proxy checkpoints are still CLIP-based and were run with the world model disabled. In:

  • /workspace/workspace/VLAarchtests2/VLAarchtests/artifacts/outputs/r3d_handoff_phase/proxy_interaction_r3d_stage3_clip_rgbd_handoff_compact_phase_v7_selector_finetune_iter6_seed17/config_resolved.yaml

the resolved config shows:

  • policy_type: elastic_reveal
  • use_world_model: false
  • model_name: openai/clip-vit-base-patch32

So the codebase contains a large world-model path, but the best proxy checkpoints were not actually validating that full path.

What The Tests Really Validate

The main fixtures in:

  • /workspace/workspace/VLAarchtests2_code/VLAarchtests/tests/conftest.py

use tiny settings:

  • hidden dim 16
  • chunk size 2
  • field size 4
  • random 16x16 RGB-D
  • dummy backbone

This is good for contract testing, not policy competence.

My local short validation on the copied snapshot:

  • command: pytest -q test_proxy_scripted_bench.py test_candidate_ranking_loss.py test_policy_topk_cascade.py test_task_routed_model_eval.py
  • result: 15 passed, 2 warnings in 1.34s

That means the copied snapshot is internally consistent for small contract and proxy checks. It does not mean the policy is benchmark-strong.

What The Proxy Reports Actually Say

The most important proxy report is:

  • /workspace/workspace/VLAarchtests2_reports/VLAarchtests/artifacts/reports/sprint_v7_summary/reveal_sprint_summary.md

Main numbers:

  • random: 0.433
  • oracle: 0.407
  • base_model: 0.280
  • no_planner: 0.200
  • no_memory: 0.323
  • no_task_conditioning: 0.280
  • no_geometry: 0.270
  • cloth for base_model: 0.140

Interpretation:

  • the learned controller is below random on its own candidate set,
  • planner matters,
  • memory looks harmful or at least unproven,
  • task conditioning is flat in the checkpoint,
  • geometry helps only modestly,
  • cloth is the clearest ranking/utility failure case.

The follow-up debug report is even more revealing:

  • /workspace/workspace/VLAarchtests2_reports/VLAarchtests/artifacts/reports/sprint_v7_followup/deep_debug_summary.md

It shows:

  • planner on teacher-supplied candidates is healthy,
  • the dominant live failure is proposal-logit shortlisting,
  • cloth oracle-best candidate is excluded from shortlist 85% of the time,
  • removing shortlist or ignoring proposal logits gives a large improvement,
  • cloth oracle ceiling rises sharply after a utility correction.

This is a strong signal that the structural reveal idea is not dead. The selector path is the bigger problem.

The best proxy controller in the repo is the task-routed controller:

  • /workspace/workspace/VLAarchtests2_reports/VLAarchtests/artifacts/reports/task_routed_proxy_v1/summary.md

Numbers:

  • mean success 0.4867
  • foliage 0.46
  • bag 0.41
  • cloth 0.59

This is useful evidence that task-specific bias matters. It is not evidence that one clean unified model already solved the problem.

What The General-Task Reports Actually Say

The current general-task anchor result is weak:

  • /workspace/workspace/VLAarchtests2_reports/VLAarchtests/artifacts/reports/general_task_anchor_20260330_dual_push_buttons/summary.md

It shows:

  • public AnyBimanual release: success 0.960
  • local official AnyBimanual eval: success 0.960
  • local clip backbone-only: 0.000
  • local elastic reveal proxy iter6: 0.000
  • local RVT frozen fixed-bounds: 0.000

That is enough to say the current local custom path is not yet a valid scientific base for claims about general bimanual competence.

Was The Non-Zero RLBench Result Real?

The answer is:

  • real as a positive control,
  • not real as evidence that the elastic architecture is competitive on general RLBench tasks.

The relevant report is:

  • /workspace/workspace/VLAarchtests2_reports/VLAarchtests/artifacts/reports/dual_push_nonzero_branch_20260330/summary.md

It shows:

  • direct rollout smoke: 0.0
  • controller sweep: 0.0
  • weighted rollout smoke: 0.0
  • chunk-supervised probe: 0.0
  • retargeted demo variants: 1.0

The later hybrid path makes the mechanism explicit. In:

  • /workspace/workspace/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/eval/run_rlbench_dual_push_full_arch_hybrid_eval.py
  • /workspace/workspace/VLAarchtests2_code/VLAarchtests/code/reveal_vla_bimanual/eval/dual_push_full_arch_utils.py

the evaluation:

  • builds a demo feature bank,
  • retrieves the nearest demo,
  • retargets demo poses to live button locations,
  • creates hybrid candidates including retargeted_demo_base and retargeted_demo_bridge,
  • lets the planner choose among these hybrid candidates and residualized controller variants.

So the non-zero line is not "cheating" in the narrow sense. But it is not the architecture you want to publish. It is hybrid demo retrieval plus retargeting.

My conclusion: do not treat this as proof that the current elastic policy is ready for a full RLBench sweep.

Direct Answers To The Main Questions

1. Do the tests invalidate the structural idea?

No.

They invalidate some implementation choices, especially:

  • current learned shortlist/logit selector,
  • current memory stack,
  • current validation story for the heavy world model.

They do not invalidate the core reveal/retrieve structure.

2. Should the current architecture be pushed into a full RLBench sweep?

No.

Not before you first show:

  • a strong public trunk baseline is reproduced fairly,
  • trunk + adapter_noop is no worse than trunk,
  • trunk + adapter_active helps on reveal/retrieve-like public tasks or clean proxy tasks.

3. Was the last non-zero RLBench score a real win?

No, not as an architectural claim.

It is a useful positive control showing that the evaluation plumbing can succeed when demo retrieval and retargeting provide a strong base trajectory. That is different from showing the elastic occlusion architecture itself is strong.

4. Is the idea still potentially novel?

Yes, but only if the claim is narrowed.

The claim should not be:

  • new general bimanual VLA,
  • new general 3D trunk,
  • new overall SOTA bimanual foundation model.

The claim should be:

  • a structured adapter for reveal/retrieve under elastic occlusion on top of a strong public trunk,
  • with explicit reveal-state prediction,
  • task-routed reveal macros,
  • retrieve-feasibility gating,
  • and task-specific disturbance/fold-preservation awareness.

That is modestly novel and scientifically cleaner.

Literature Positioning

The strongest nearby general bimanual references I would use are:

For the target task family, the most relevant references are:

My synthesis from those sources:

  • Active perception under occlusion is already a real literature thread.
  • Bag-specific active reveal and bag structure modeling already exist.
  • Generic bimanual baselines already include strong public systems.
  • What still looks underexplored is disturbance-aware reveal/retrieve with explicit fold-preservation style structure for a suitcase/clothes setting.

That makes the clothes/suitcase task your strongest publication angle.

Recommended Architecture

Do not keep the current monolith as the target system.

Build:

  • a strong public trunk,
  • plus a small elastic-occlusion adapter.

Trunk choice

Order of preference:

  1. 3D FlowMatch Actor, if the official path is practical.
  2. Official PerAct2 or official RVT-style path.
  3. Official AnyBimanual if it is the fastest stable local path and you want the lowest engineering risk.

Adapter contents

Keep exactly four core pieces:

  • reveal-state head,
  • task-routed proposal prior,
  • retrieve-feasibility gate,
  • lightweight reveal-state transition model.

Default removals from the current monolith:

  • remove heavy dual memory as a required dependency,
  • remove full token-heavy world model as default,
  • make both optional ablations rather than the baseline path.

Critical requirement

Add a true no-op mode:

  • adapter_off
  • adapter_noop
  • adapter_active

Without this, you cannot prove that the adapter preserves general competence.

Recommended Benchmark Strategy

Do not jump straight to a massive RLBench sweep on the current repo.

Use four stages:

Stage 1. Reproduce a strong public trunk

Pick one official trunk path and verify it locally on a small public anchor set.

Minimum anchor set:

  • bimanual_push_box
  • bimanual_lift_ball
  • bimanual_dual_push_buttons
  • bimanual_handover_item
  • bimanual_lift_tray

Goal:

  • official numbers are approximately reproducible,
  • your local evaluation path is trustworthy.

Stage 2. Prove no regression

Add adapter wiring with:

  • adapter_off
  • adapter_noop

Goal:

  • trunk + adapter_noop matches trunk within noise on the anchor set.

Stage 3. Train only the structured adapter

Use public sim and clean proxy labels for:

  • visibility gain,
  • access corridor,
  • persistence/support,
  • reocclusion,
  • disturbance,
  • cloth fold-preservation style metrics when available.

Train the adapter with the trunk frozen or nearly frozen.

Stage 4. Evaluate on reveal/retrieve stress tasks

Use:

  • the current proxy benchmark as a development instrument,
  • PerAct2 bimanual tasks that stress containment/opening/retrieval,
  • GarmentLab as soon as the stack is runnable.

For the paper story, you do not need to dominate all bimanual tasks. You need:

  • same ballpark as strong baselines on general public tasks,
  • clear gains on elastic-occlusion reveal/retrieve tasks.

What I Would Not Do Next

I would not:

  • run a full RLBench sweep on the current monolithic elastic stack,
  • spend more time trying to rescue CLIP as the scientific backbone,
  • keep changing memory, planner, world model, and backbone all at once,
  • claim the retargeted-demo hybrid result as proof of the full architecture.

What I Would Do Next

In order:

  1. Pick the public trunk to standardize on.
  2. Refactor the repo into trunk, adapter, and wrapped policy with a real no-op path.
  3. Port only the best structural parts:
    • reveal-state metrics,
    • task-routed proposal vocabulary,
    • retrieve-feasibility gate.
  4. Make memory and world model optional ablations, not default requirements.
  5. Re-run the proxy benchmark only as a selector/utility-development tool.
  6. Move quickly to fair public trunk-preservation and reveal-task evaluations.

Final Recommendation

The project is still alive, but the win condition needs to change.

Do not try to prove that the current repo is already a new SOTA general bimanual VLA.

Do try to build a defensible paper around:

  • a strong public bimanual trunk,
  • plus a small structured elastic-occlusion adapter,
  • with explicit reveal-state prediction and retrieve-feasibility control,
  • validated by no-regression on public bimanual tasks and gains on reveal/retrieve tasks.

If you make that pivot now, the repo still contains enough good structure to become a credible research system.