YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
VLAarchtests2
Bundle staged from /workspace on 2026-03-31 UTC.
This repo is the follow-on organization repo to lsnu/VLAarchtests. It includes:
- current code under
VLAarchtests/ - current third-party baseline code under
third_party/ - current baseline runs, replay artifacts, demo roots, and released checkpoint material under
baselines/ - current training outputs and checkpoints under
outputs/ - current logs under
reports/ - environment recreation files under
environment/ - raw results and change/test logs at the repo root
- the previous repo README under
history/VLAarchtests_previous_README.md - the active handoff file under
handoff/instructions4.md
Top-Level Contents
VLAarchtests/- code, tests, configs, generated configs, reports, checkpoints, and proxy datasets from the current runpod workspace
third_party/AnyBimanual/- local AnyBimanual checkout used for the official overlap baseline branch, including local compatibility patches
baselines/- released AnyBimanual checkpoint material
- overlap replay artifacts
- HF export packaging note:
baselines/AnyBimanual_overlap_replay/multi/is sharded into subdirectories to satisfy the Hub10000 files per directorylimit
- HF export packaging note:
- overlap run directories
- local subset3 demo roots used by the overlap branch
outputs/- RLBench training outputs and checkpoints used by the current anchor, RVT, dual-push, and elastic-controller branches
reports/- training and evaluation logs copied from
/workspace/reports
- training and evaluation logs copied from
environment/- machine snapshot, package lists, and setup helpers
history/- copied previous-repo README
handoff/- active sprint instruction file
RESULTS_RAW.md- raw result tables and final official overlap eval outputs
CHANGE_AND_TEST_LOG.md- file-level change log and executed test commands
MODEL_AND_ARTIFACT_INDEX.md- staged directory map with main artifact roots
Previous Repo Coverage
The earlier lsnu/VLAarchtests repo covered the 2026-03-25/26 work. Its README is copied verbatim at:
history/VLAarchtests_previous_README.md
Previous-repo items explicitly referenced there include:
- compact, spatial, compact-phase, and spatial-phase proxy branches
- earlier RLBench direct-policy and kNN runs
- environment recreation files
- prior raw result tables
Current Session Additions
Current-session folders added or expanded in this repo include:
VLAarchtests/artifacts/reports/sprint_v7_summary/VLAarchtests/artifacts/reports/sprint_v7_followup/VLAarchtests/artifacts/reports/selector_finetune_v7_iterations/VLAarchtests/artifacts/reports/selector_finetune_v7_iter6/VLAarchtests/artifacts/reports/selector_finetune_v7_iter7/VLAarchtests/artifacts/reports/selector_finetune_v7_iter8/VLAarchtests/artifacts/reports/task_routed_proxy_v1/VLAarchtests/artifacts/reports/rlbench_general_debug_20260330/VLAarchtests/artifacts/reports/anybimanual_overlap_baseline_20260330/VLAarchtests/artifacts/reports/bag_mode_specialization_20260330/VLAarchtests/artifacts/reports/general_task_anchor_20260330_dual_push_buttons/VLAarchtests/artifacts/reports/rvt_overlap_branch_20260330/VLAarchtests/artifacts/reports/rvt_overlap_branch_fixedbounds_20260330/VLAarchtests/artifacts/reports/dual_push_nonzero_branch_20260330/VLAarchtests/artifacts/reports/dual_push_full_arch_hybrid_20260331/
Raw Results Snapshot
Proxy sprint v7
Source:
VLAarchtests/artifacts/reports/sprint_v7_summary/reveal_sprint_summary_compact.json
Raw values:
- base model mean success:
0.28 - base per-task: foliage
0.39, bag0.31, cloth0.14 - random mean success:
0.43333333333333335 - candidate0 mean success:
0.2 - oracle mean success:
0.4066666666666667 - scripted mean success:
1.0
Eval-time ablations
Source:
VLAarchtests/artifacts/reports/sprint_v7_summary/reveal_sprint_summary_compact.json
Raw values:
no_planner:0.2no_memory:0.3233333333333333no_task_conditioning:0.28no_geometry:0.27no_camera_pose:0.29333333333333333
Selector checkpoints
Sources:
VLAarchtests/artifacts/reports/selector_finetune_v7_iter6/default/reveal_benchmark.jsonVLAarchtests/artifacts/reports/selector_finetune_v7_iter7/full_fixed_default/reveal_benchmark.jsonVLAarchtests/artifacts/reports/selector_finetune_v7_iter8/bag_fixed_default/reveal_benchmark.jsonVLAarchtests/artifacts/reports/task_routed_proxy_v1/summary.md
Raw values:
iter6mean success:0.4566666666666667- foliage
0.46, bag0.4, cloth0.51
- foliage
iter7mean success:0.4666666666666666- foliage
0.4, bag0.41, cloth0.59
- foliage
iter8bag-only fixed slice:0.41- routed controller mean success:
0.48666666666666664- routing rule:
foliage -> iter6,bag -> iter8,cloth -> iter8 - per-task: foliage
0.46, bag0.41, cloth0.59
- routing rule:
Real baseline compare on proxy suite
Source:
VLAarchtests/artifacts/reports/real_baseline_compare_v7_full/reveal_benchmark.json
Raw values:
baseline_rgbd_stage3mean success:0.31- foliage
0.21, bag0.15, cloth0.57
- foliage
iter5_selectormean success:0.45- foliage
0.44, bag0.4, cloth0.51
- foliage
RLBench recovered push-box comparator
Sources:
reports/rlbench_general_debug/rlbench_push_box_fair_step1_final_knn_ep10_x99_res224_len180_train80_fixed/bimanual_push_box/rollout_eval.jsonreports/rlbench_general_debug/rlbench_push_box_historical_step1_knn_ep10_x99_res224_len180_train80_fixed/bimanual_push_box/rollout_eval.json
Raw values:
- current fair-step1 final mean success:
0.7 - current fair-step1 final successes:
[1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0]
- historical push-box control mean success:
0.4 - historical push-box control successes:
[0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0]
Official AnyBimanual overlap branch
Sources:
baselines/AnyBimanual_overlap_runs/peract_bc_subset3_overlap_smoke200_fixpretrain_nowandb3/PERACT_BC/seed0/training.logreports/anybimanual_subset3_overlap_resume1000_eval.log
Raw train milestones:
- global step
300: loss40.91718 - global step
400: loss33.26684 - global step
500: loss36.07054 - global step
600: loss35.32345 - global step
700: loss28.50959 - global step
800: loss23.60169 - global step
900: loss15.28901 - run reached
weights/1000and the train exited cleanly
Raw eval outputs:
- source log:
reports/anybimanual_subset3_overlap_resume1000_eval.log - summary files:
VLAarchtests/artifacts/reports/anybimanual_overlap_baseline_20260330/resume1000_summary/summary.mdVLAarchtests/artifacts/reports/anybimanual_overlap_baseline_20260330/resume1000_summary/summary.json
- local last complete step:
1000 - local mean success:
0.16 - local per-task success:
coordinated_push_box:0.0coordinated_lift_ball:0.0dual_push_buttons:0.48
- local per-task return:
coordinated_push_box:0.0coordinated_lift_ball:0.0dual_push_buttons:12.0
- public best overlap step in the local summary:
60000 - public best mean success in the local summary:
0.6933333333333334
Validated general-task anchor: dual_push_buttons
Sources:
VLAarchtests/artifacts/reports/general_task_anchor_20260330_dual_push_buttons/summary.jsonbaselines/AnyBimanual_release_eval_anchor/perlf_release_dual_push_buttons_ep25/PERACT_BC/seed0/eval_data.csv
Raw values:
- public AnyBimanual release, step
60000: success0.96, return24.0, length21.56 - local official single-task eval, step
60000,25episodes: success0.96, return24.0, length21.84 - local clip backbone-only result on same task: success
0.0, return0.0 - local elastic reveal proxy iter6 result on same task: success
0.0, return0.0 - local RVT frozen fixed-bounds result on same task: success
0.0, return0.0
RVT overlap branch
Sources:
VLAarchtests/artifacts/reports/rvt_overlap_branch_20260330/summary.mdVLAarchtests/artifacts/reports/rvt_overlap_branch_fixedbounds_20260330/summary.md
Raw values:
- frozen RVT stage1 train summary:
outputs/rlbench_rvt_branch/rlbench_subset3_backbone_only_rvt_100demo_frozen_seed17/summary.json- final train total
0.043179353826920445 - final val total
0.039591669984665984
- frozen RVT overlap eval: mean success
0.0 - frozen fixed-bounds RVT overlap eval: mean success
0.0 - both branch gates:
- local AnyBimanual overlap floor
0.16 - stage2 run
false
- local AnyBimanual overlap floor
Dual-push non-privileged retarget branch
Sources:
VLAarchtests/artifacts/reports/dual_push_nonzero_branch_20260330/summary.md
Raw values:
- demo replay through
absolute_action_from_delta:reports/dual_push_nonzero_branch_20260330/demo_replay/replay_summary.json- mean success
0.8 - mean return
0.8
- retargeted demo with checkpoint backbone retrieval and vision-only button localization:
reports/dual_push_nonzero_branch_20260330/retargeted_demo_backbone_vision_ep1/summary.json- mean success
1.0 - mean return
1.0
- retargeted demo with checkpoint backbone retrieval and vision-only button localization:
reports/dual_push_nonzero_branch_20260330/retargeted_demo_backbone_vision_ep5/summary.json- mean success
1.0 - mean return
1.0
Dual-push full-architecture hybrid branch
Sources:
VLAarchtests/artifacts/reports/dual_push_full_arch_hybrid_20260331/summary.mdreports/dual_push_full_arch_probe_iter6_scene_ep1/summary.jsonreports/dual_push_full_arch_hybrid_iter6_backbone_ep1/summary.json
Raw values:
- elastic checkpoint retargeted-demo probe with scene retrieval and vision-only button localization:
1episode- mean success
1.0 - mean return
1.0 - steps
94 - retrieved episode index
11 - retrieval similarity
0.9998629689216614
- full-architecture hybrid eval with elastic controller checkpoint plus dual-push retrieval checkpoint:
1episode- mean success
1.0 - mean return
1.0 - steps
116 - path recoveries
0 - noop fallbacks
0 - first selected mode
residual::maintain_opening - last selected mode
residual::base_action
Environment Recreation
Environment files are under environment/, including:
environment/setup_same_hardware.shenvironment/runtime_env_vars.shenvironment/reconstruct_anybimanual_overlap_replay.shenvironment/hardware_snapshot.txtenvironment/env_list.txtenvironment/base_python.txtenvironment/base_pip_freeze.txtenvironment/rlbench_python.txtenvironment/rlbench_pip_freeze.txt
Notes On Result Presentation
This repo-level README and the new root docs intentionally keep result text raw:
- file paths
- exact commands
- exact numeric outputs
- exact partial status for in-flight runs
Interpretive material already present inside older staged artifacts remains preserved as part of the historical workspace contents.