YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

VLAarchtests2

Bundle staged from /workspace on 2026-03-31 UTC.

This repo is the follow-on organization repo to lsnu/VLAarchtests. It includes:

current code under VLAarchtests/
current third-party baseline code under third_party/
current baseline runs, replay artifacts, demo roots, and released checkpoint material under baselines/
current training outputs and checkpoints under outputs/
current logs under reports/
environment recreation files under environment/
raw results and change/test logs at the repo root
the previous repo README under history/VLAarchtests_previous_README.md
the active handoff file under handoff/instructions4.md

Top-Level Contents

VLAarchtests/
- code, tests, configs, generated configs, reports, checkpoints, and proxy datasets from the current runpod workspace
third_party/AnyBimanual/
- local AnyBimanual checkout used for the official overlap baseline branch, including local compatibility patches
baselines/
- released AnyBimanual checkpoint material
- overlap replay artifacts
  - HF export packaging note: baselines/AnyBimanual_overlap_replay/multi/ is sharded into subdirectories to satisfy the Hub 10000 files per directory limit
- overlap run directories
- local subset3 demo roots used by the overlap branch
outputs/
- RLBench training outputs and checkpoints used by the current anchor, RVT, dual-push, and elastic-controller branches
reports/
- training and evaluation logs copied from /workspace/reports
environment/
- machine snapshot, package lists, and setup helpers
history/
- copied previous-repo README
handoff/
- active sprint instruction file
RESULTS_RAW.md
- raw result tables and final official overlap eval outputs
CHANGE_AND_TEST_LOG.md
- file-level change log and executed test commands
MODEL_AND_ARTIFACT_INDEX.md
- staged directory map with main artifact roots

Previous Repo Coverage

The earlier lsnu/VLAarchtests repo covered the 2026-03-25/26 work. Its README is copied verbatim at:

history/VLAarchtests_previous_README.md

Previous-repo items explicitly referenced there include:

compact, spatial, compact-phase, and spatial-phase proxy branches
earlier RLBench direct-policy and kNN runs
environment recreation files
prior raw result tables

Current Session Additions

Current-session folders added or expanded in this repo include:

VLAarchtests/artifacts/reports/sprint_v7_summary/
VLAarchtests/artifacts/reports/sprint_v7_followup/
VLAarchtests/artifacts/reports/selector_finetune_v7_iterations/
VLAarchtests/artifacts/reports/selector_finetune_v7_iter6/
VLAarchtests/artifacts/reports/selector_finetune_v7_iter7/
VLAarchtests/artifacts/reports/selector_finetune_v7_iter8/
VLAarchtests/artifacts/reports/task_routed_proxy_v1/
VLAarchtests/artifacts/reports/rlbench_general_debug_20260330/
VLAarchtests/artifacts/reports/anybimanual_overlap_baseline_20260330/
VLAarchtests/artifacts/reports/bag_mode_specialization_20260330/
VLAarchtests/artifacts/reports/general_task_anchor_20260330_dual_push_buttons/
VLAarchtests/artifacts/reports/rvt_overlap_branch_20260330/
VLAarchtests/artifacts/reports/rvt_overlap_branch_fixedbounds_20260330/
VLAarchtests/artifacts/reports/dual_push_nonzero_branch_20260330/
VLAarchtests/artifacts/reports/dual_push_full_arch_hybrid_20260331/

Raw Results Snapshot

Proxy sprint v7

Source:

VLAarchtests/artifacts/reports/sprint_v7_summary/reveal_sprint_summary_compact.json

Raw values:

base model mean success: 0.28
base per-task: foliage 0.39, bag 0.31, cloth 0.14
random mean success: 0.43333333333333335
candidate0 mean success: 0.2
oracle mean success: 0.4066666666666667
scripted mean success: 1.0

Eval-time ablations

Source:

VLAarchtests/artifacts/reports/sprint_v7_summary/reveal_sprint_summary_compact.json

Raw values:

no_planner: 0.2
no_memory: 0.3233333333333333
no_task_conditioning: 0.28
no_geometry: 0.27
no_camera_pose: 0.29333333333333333

Selector checkpoints

Sources:

VLAarchtests/artifacts/reports/selector_finetune_v7_iter6/default/reveal_benchmark.json
VLAarchtests/artifacts/reports/selector_finetune_v7_iter7/full_fixed_default/reveal_benchmark.json
VLAarchtests/artifacts/reports/selector_finetune_v7_iter8/bag_fixed_default/reveal_benchmark.json
VLAarchtests/artifacts/reports/task_routed_proxy_v1/summary.md

Raw values:

iter6 mean success: 0.4566666666666667
- foliage 0.46, bag 0.4, cloth 0.51
iter7 mean success: 0.4666666666666666
- foliage 0.4, bag 0.41, cloth 0.59
iter8 bag-only fixed slice: 0.41
routed controller mean success: 0.48666666666666664
- routing rule: foliage -> iter6, bag -> iter8, cloth -> iter8
- per-task: foliage 0.46, bag 0.41, cloth 0.59

Real baseline compare on proxy suite

Source:

VLAarchtests/artifacts/reports/real_baseline_compare_v7_full/reveal_benchmark.json

Raw values:

baseline_rgbd_stage3 mean success: 0.31
- foliage 0.21, bag 0.15, cloth 0.57
iter5_selector mean success: 0.45
- foliage 0.44, bag 0.4, cloth 0.51

RLBench recovered push-box comparator

Sources:

reports/rlbench_general_debug/rlbench_push_box_fair_step1_final_knn_ep10_x99_res224_len180_train80_fixed/bimanual_push_box/rollout_eval.json
reports/rlbench_general_debug/rlbench_push_box_historical_step1_knn_ep10_x99_res224_len180_train80_fixed/bimanual_push_box/rollout_eval.json

Raw values:

current fair-step1 final mean success: 0.7
current fair-step1 final successes:
- [1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0]
historical push-box control mean success: 0.4
historical push-box control successes:
- [0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0]

Official AnyBimanual overlap branch

Sources:

baselines/AnyBimanual_overlap_runs/peract_bc_subset3_overlap_smoke200_fixpretrain_nowandb3/PERACT_BC/seed0/training.log
reports/anybimanual_subset3_overlap_resume1000_eval.log

Raw train milestones:

global step 300: loss 40.91718
global step 400: loss 33.26684
global step 500: loss 36.07054
global step 600: loss 35.32345
global step 700: loss 28.50959
global step 800: loss 23.60169
global step 900: loss 15.28901
run reached weights/1000 and the train exited cleanly

Raw eval outputs:

source log: reports/anybimanual_subset3_overlap_resume1000_eval.log
summary files:
- VLAarchtests/artifacts/reports/anybimanual_overlap_baseline_20260330/resume1000_summary/summary.md
- VLAarchtests/artifacts/reports/anybimanual_overlap_baseline_20260330/resume1000_summary/summary.json
local last complete step: 1000
local mean success: 0.16
local per-task success:
- coordinated_push_box: 0.0
- coordinated_lift_ball: 0.0
- dual_push_buttons: 0.48
local per-task return:
- coordinated_push_box: 0.0
- coordinated_lift_ball: 0.0
- dual_push_buttons: 12.0
public best overlap step in the local summary: 60000
public best mean success in the local summary: 0.6933333333333334

Validated general-task anchor: `dual_push_buttons`

Sources:

VLAarchtests/artifacts/reports/general_task_anchor_20260330_dual_push_buttons/summary.json
baselines/AnyBimanual_release_eval_anchor/perlf_release_dual_push_buttons_ep25/PERACT_BC/seed0/eval_data.csv

Raw values:

public AnyBimanual release, step 60000: success 0.96, return 24.0, length 21.56
local official single-task eval, step 60000, 25 episodes: success 0.96, return 24.0, length 21.84
local clip backbone-only result on same task: success 0.0, return 0.0
local elastic reveal proxy iter6 result on same task: success 0.0, return 0.0
local RVT frozen fixed-bounds result on same task: success 0.0, return 0.0

RVT overlap branch

Sources:

VLAarchtests/artifacts/reports/rvt_overlap_branch_20260330/summary.md
VLAarchtests/artifacts/reports/rvt_overlap_branch_fixedbounds_20260330/summary.md

Raw values:

frozen RVT stage1 train summary:
- outputs/rlbench_rvt_branch/rlbench_subset3_backbone_only_rvt_100demo_frozen_seed17/summary.json
- final train total 0.043179353826920445
- final val total 0.039591669984665984
frozen RVT overlap eval: mean success 0.0
frozen fixed-bounds RVT overlap eval: mean success 0.0
both branch gates:
- local AnyBimanual overlap floor 0.16
- stage2 run false

Dual-push non-privileged retarget branch

Sources:

VLAarchtests/artifacts/reports/dual_push_nonzero_branch_20260330/summary.md

Raw values:

demo replay through absolute_action_from_delta:
- reports/dual_push_nonzero_branch_20260330/demo_replay/replay_summary.json
- mean success 0.8
- mean return 0.8
retargeted demo with checkpoint backbone retrieval and vision-only button localization:
- reports/dual_push_nonzero_branch_20260330/retargeted_demo_backbone_vision_ep1/summary.json
- mean success 1.0
- mean return 1.0
retargeted demo with checkpoint backbone retrieval and vision-only button localization:
- reports/dual_push_nonzero_branch_20260330/retargeted_demo_backbone_vision_ep5/summary.json
- mean success 1.0
- mean return 1.0

Dual-push full-architecture hybrid branch

Sources:

VLAarchtests/artifacts/reports/dual_push_full_arch_hybrid_20260331/summary.md
reports/dual_push_full_arch_probe_iter6_scene_ep1/summary.json
reports/dual_push_full_arch_hybrid_iter6_backbone_ep1/summary.json

Raw values:

elastic checkpoint retargeted-demo probe with scene retrieval and vision-only button localization:
- 1 episode
- mean success 1.0
- mean return 1.0
- steps 94
- retrieved episode index 11
- retrieval similarity 0.9998629689216614
full-architecture hybrid eval with elastic controller checkpoint plus dual-push retrieval checkpoint:
- 1 episode
- mean success 1.0
- mean return 1.0
- steps 116
- path recoveries 0
- noop fallbacks 0
- first selected mode residual::maintain_opening
- last selected mode residual::base_action

Environment Recreation

Environment files are under environment/, including:

environment/setup_same_hardware.sh
environment/runtime_env_vars.sh
environment/reconstruct_anybimanual_overlap_replay.sh
environment/hardware_snapshot.txt
environment/env_list.txt
environment/base_python.txt
environment/base_pip_freeze.txt
environment/rlbench_python.txt
environment/rlbench_pip_freeze.txt

Notes On Result Presentation

This repo-level README and the new root docs intentionally keep result text raw:

file paths
exact commands
exact numeric outputs
exact partial status for in-flight runs

Interpretive material already present inside older staged artifacts remains preserved as part of the historical workspace contents.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

VLAarchtests2

Top-Level Contents

Previous Repo Coverage

Current Session Additions

Raw Results Snapshot

Proxy sprint v7

Eval-time ablations

Selector checkpoints

Real baseline compare on proxy suite

RLBench recovered push-box comparator

Official AnyBimanual overlap branch

Validated general-task anchor: dual_push_buttons

RVT overlap branch

Dual-push non-privileged retarget branch

Dual-push full-architecture hybrid branch

Environment Recreation

Notes On Result Presentation

Validated general-task anchor: `dual_push_buttons`