YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

VLAarchtests3

VLAarchtests3 is the organized export of the elastic-occlusion bimanual VLA handoff completed on a 1x L40S RunPod machine.

It is a successor snapshot to the earlier VLAarchtests and VLAarchtests2 work:

  • VLAarchtests: earlier architecture-search and benchmark-debugging work.
  • VLAarchtests2: larger exploratory branch with frequent model changes, mixed benchmark artifacts, and several legacy results that needed manual reinterpretation.
  • VLAarchtests3: cleaned export focused on the final handoff state, the adapter refactor, the validated tests, the current checkpoints, and the reports needed to continue from here.

What Was Done

The main engineering outcome was a refactor from a monolithic elastic policy into a cleaner trunk + structured adapter + no-op fallback stack.

The final exported code contains:

  • a clean wrapped-policy interface with trunk_only, adapter_noop, and adapter_active modes,
  • a structured elastic-occlusion adapter with:
    • reveal-state prediction,
    • task-routed reveal/retrieve proposal families,
    • retrieve-feasibility gating,
    • a lightweight reveal-state transition model,
  • explicit tests that protect:
    • no-op equivalence,
    • generic-task fallback,
    • benchmark protocol identity,
    • unsafe retrieve blocking,
    • cloth-specific selection behavior.

The most important debugging pass was in the planner/gating logic. The original active path could reveal forever or retrieve too early. The final planner fixes made it:

  • summarize scene readiness at the scene level rather than worst-candidate level,
  • hard-mask unsafe retrieve candidates,
  • switch from reveal to retrieve once feasibility is met,
  • use task-specific bag and cloth readiness criteria,
  • prefer reveal macros early and retrieve later.

What Was Actually Evaluated

Two different kinds of evidence are included.

1. Trusted General-Task Anchor

This was kept narrow on purpose because only dual_push_buttons was trusted on this setup.

Trusted anchor evidence:

  • official AnyBimanual local anchor summary on dual_push_buttons:
    • 25 episodes
    • success 0.96
  • live rerun on this RunPod:
    • 5 episodes
    • scores [0, 100, 100, 0, 0]
    • mean score 40.0

Interpretation:

  • the official trunk path is real and non-trivial on the one stable anchor task,
  • this does not mean the local custom CLIP trunk was competitive broadly,
  • this does not validate the other unstable RLBench target-like tasks.

2. Reveal/Retrieve Proxy Benchmark

This benchmark is useful for mechanism debugging, but it is not a real robot/physics benchmark.

The final reported held-out smoke benchmark used:

  • 12 foliage episodes,
  • 12 bag episodes,
  • 12 cloth episodes,
  • 36 total episodes,
  • separate held-out procedural seeds from the adapter train/val splits.

Results:

  • non-intervention / matched no-op:

    • mean success 0.000
    • foliage 0.000
    • bag 0.000
    • cloth 0.000
    • visibility integral 2.275
    • corridor availability 0.0312
    • disturbance cost 0.7433
  • intervention / adapter active:

    • mean success 0.6667
    • foliage 0.6667
    • bag 0.7500
    • cloth 0.5833
    • visibility integral 19.9503
    • corridor availability 0.7974
    • disturbance cost 0.2835
    • reocclusion rate 0.00278
    • planner regret 0.1586

The active policy did really intervene on these tasks. It did not just fall back silently to the trunk:

  • all recorded selections on the final held-out smoke run were non-base candidates,
  • typical successful pattern:
    • foliage: reveal (pin_canopy) then retrieve,
    • bag: reveal (widen_mouth) then retrieve,
    • cloth: reveal (separate_layer) then retrieve.

Important Limitation

The reveal/retrieve proxy is a procedural synthetic environment, not a contact-rich robot simulator.

It has:

  • synthetic RGB-D renders,
  • internal latent state,
  • hand-coded transition rules,
  • scripted teacher/oracle supervision.

It does not have:

  • rigid-body or deformable physics,
  • actual robot kinematics,
  • true contact/grasp simulation,
  • a fair end-to-end manipulation distribution for a pretrained trunk.

Therefore:

  • the proxy result is useful to validate adapter logic,
  • the proxy result is not sufficient evidence that the trunk or the full system would outperform real baselines on RLBench or on the future custom benchmark.

What Was Learned

The work supports the following conclusions:

  • the structured adapter idea is still alive,
  • the explicit reveal-state variables are worth keeping,
  • task-routed reveal macros matter,
  • retrieve-feasibility gating matters,
  • the no-op fallback path for general tasks is sound,
  • the old heavy memory/world-model story is not where the strongest evidence lives.

The work does not yet justify:

  • a claim of broad general-task superiority,
  • a claim that the current proxy benchmark is a fair end-to-end benchmark,
  • a claim that the architecture is validated on realistic target-like sim tasks.

Was The Adapter Trained?

Yes.

The final proxy adapter checkpoint was trained with:

  • frozen trunk,
  • adapter-only updates,
  • trained components:
    • reveal/state head,
    • proposal prior,
    • transition model,
    • planner/reranker.

Proxy training data:

  • train: 128 episodes per proxy family,
  • val: 32 episodes per proxy family,
  • proxy families:
    • foliage,
    • bag,
    • cloth.

The final headline smoke benchmark was not run on those train/val episodes. It used separate held-out seeds.

Was This A Perfect Fairness Story?

No.

What is fair in the current export:

  • matched active vs no-op comparisons on the same wrapped checkpoint,
  • held-out procedural seeds for the final proxy benchmark,
  • exact no-op and generic-task fallback tests.

What is still missing for a stronger paper-quality comparison:

  1. same-initialization trunk_only fine-tuned on the same proxy data,
  2. same-initialization trunk + adapter fine-tuned on the same proxy data,
  3. comparison on held-out proxy seeds,
  4. comparison on stable real-sim tasks.

What Is Left To Do

The main remaining work is on real sim benchmarks, not more abstract proxy optimization.

Priority list:

  1. Train a fair control:

    • same initialization,
    • trunk_only fine-tuned on the same reveal/retrieve proxy data,
    • compare against trunk + adapter.
  2. Attach the adapter directly to a strong public trunk:

    • official AnyBimanual,
    • official PerAct2 / RVT,
    • or 3D FlowMatch Actor if practical.
  3. Validate on stable real-sim tasks:

    • do not trust unstable RLBench tasks with infeasible waypoints,
    • rebuild a trustworthy target-like evaluation subset,
    • keep dual_push_buttons as a regression anchor only.
  4. Add a deformable / garment benchmark:

    • this is the most relevant public step toward the future suitcase/clothes benchmark.
  5. Only after that:

    • revisit larger RLBench sweeps,
    • or collect custom teleop data.

Repository Layout

  • code/
    • cleaned code snapshot used for the handoff
  • artifacts/outputs/
    • current adapter checkpoints and training outputs
  • artifacts/reports/
    • evaluation and debugging reports
  • artifacts/data/reveal_proxy/
    • proxy train/val datasets used by this stage
  • legacy/
    • exact older checkpoints and summaries that the current work depends on
  • docs/
    • audit, iteration, and completion reports from this handoff
  • setup/
    • same-machine environment notes and helper scripts

Recommended Use Of This Repo

Use this repo as:

  • the archival handoff state,
  • the codebase to continue adapter work from,
  • the source of the current checkpoints and benchmark reports,
  • the baseline package before moving to real sim validation.

Do not use it as evidence that the architecture is already validated on realistic manipulation benchmarks. That validation is what should happen next.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support