# VLAarchtests2 Bundle staged from `/workspace` on `2026-03-31 UTC`. This repo is the follow-on organization repo to `lsnu/VLAarchtests`. It includes: - current code under `VLAarchtests/` - current third-party baseline code under `third_party/` - current baseline runs, replay artifacts, demo roots, and released checkpoint material under `baselines/` - current training outputs and checkpoints under `outputs/` - current logs under `reports/` - environment recreation files under `environment/` - raw results and change/test logs at the repo root - the previous repo README under `history/VLAarchtests_previous_README.md` - the active handoff file under `handoff/instructions4.md` ## Top-Level Contents - `VLAarchtests/` - code, tests, configs, generated configs, reports, checkpoints, and proxy datasets from the current runpod workspace - `third_party/AnyBimanual/` - local AnyBimanual checkout used for the official overlap baseline branch, including local compatibility patches - `baselines/` - released AnyBimanual checkpoint material - overlap replay artifacts - HF export packaging note: `baselines/AnyBimanual_overlap_replay/multi/` is sharded into subdirectories to satisfy the Hub `10000 files per directory` limit - overlap run directories - local subset3 demo roots used by the overlap branch - `outputs/` - RLBench training outputs and checkpoints used by the current anchor, RVT, dual-push, and elastic-controller branches - `reports/` - training and evaluation logs copied from `/workspace/reports` - `environment/` - machine snapshot, package lists, and setup helpers - `history/` - copied previous-repo README - `handoff/` - active sprint instruction file - `RESULTS_RAW.md` - raw result tables and final official overlap eval outputs - `CHANGE_AND_TEST_LOG.md` - file-level change log and executed test commands - `MODEL_AND_ARTIFACT_INDEX.md` - staged directory map with main artifact roots ## Previous Repo Coverage The earlier `lsnu/VLAarchtests` repo covered the `2026-03-25/26` work. Its README is copied verbatim at: - `history/VLAarchtests_previous_README.md` Previous-repo items explicitly referenced there include: - compact, spatial, compact-phase, and spatial-phase proxy branches - earlier RLBench direct-policy and kNN runs - environment recreation files - prior raw result tables ## Current Session Additions Current-session folders added or expanded in this repo include: - `VLAarchtests/artifacts/reports/sprint_v7_summary/` - `VLAarchtests/artifacts/reports/sprint_v7_followup/` - `VLAarchtests/artifacts/reports/selector_finetune_v7_iterations/` - `VLAarchtests/artifacts/reports/selector_finetune_v7_iter6/` - `VLAarchtests/artifacts/reports/selector_finetune_v7_iter7/` - `VLAarchtests/artifacts/reports/selector_finetune_v7_iter8/` - `VLAarchtests/artifacts/reports/task_routed_proxy_v1/` - `VLAarchtests/artifacts/reports/rlbench_general_debug_20260330/` - `VLAarchtests/artifacts/reports/anybimanual_overlap_baseline_20260330/` - `VLAarchtests/artifacts/reports/bag_mode_specialization_20260330/` - `VLAarchtests/artifacts/reports/general_task_anchor_20260330_dual_push_buttons/` - `VLAarchtests/artifacts/reports/rvt_overlap_branch_20260330/` - `VLAarchtests/artifacts/reports/rvt_overlap_branch_fixedbounds_20260330/` - `VLAarchtests/artifacts/reports/dual_push_nonzero_branch_20260330/` - `VLAarchtests/artifacts/reports/dual_push_full_arch_hybrid_20260331/` ## Raw Results Snapshot ### Proxy sprint v7 Source: - `VLAarchtests/artifacts/reports/sprint_v7_summary/reveal_sprint_summary_compact.json` Raw values: - base model mean success: `0.28` - base per-task: foliage `0.39`, bag `0.31`, cloth `0.14` - random mean success: `0.43333333333333335` - candidate0 mean success: `0.2` - oracle mean success: `0.4066666666666667` - scripted mean success: `1.0` ### Eval-time ablations Source: - `VLAarchtests/artifacts/reports/sprint_v7_summary/reveal_sprint_summary_compact.json` Raw values: - `no_planner`: `0.2` - `no_memory`: `0.3233333333333333` - `no_task_conditioning`: `0.28` - `no_geometry`: `0.27` - `no_camera_pose`: `0.29333333333333333` ### Selector checkpoints Sources: - `VLAarchtests/artifacts/reports/selector_finetune_v7_iter6/default/reveal_benchmark.json` - `VLAarchtests/artifacts/reports/selector_finetune_v7_iter7/full_fixed_default/reveal_benchmark.json` - `VLAarchtests/artifacts/reports/selector_finetune_v7_iter8/bag_fixed_default/reveal_benchmark.json` - `VLAarchtests/artifacts/reports/task_routed_proxy_v1/summary.md` Raw values: - `iter6` mean success: `0.4566666666666667` - foliage `0.46`, bag `0.4`, cloth `0.51` - `iter7` mean success: `0.4666666666666666` - foliage `0.4`, bag `0.41`, cloth `0.59` - `iter8` bag-only fixed slice: `0.41` - routed controller mean success: `0.48666666666666664` - routing rule: `foliage -> iter6`, `bag -> iter8`, `cloth -> iter8` - per-task: foliage `0.46`, bag `0.41`, cloth `0.59` ### Real baseline compare on proxy suite Source: - `VLAarchtests/artifacts/reports/real_baseline_compare_v7_full/reveal_benchmark.json` Raw values: - `baseline_rgbd_stage3` mean success: `0.31` - foliage `0.21`, bag `0.15`, cloth `0.57` - `iter5_selector` mean success: `0.45` - foliage `0.44`, bag `0.4`, cloth `0.51` ### RLBench recovered push-box comparator Sources: - `reports/rlbench_general_debug/rlbench_push_box_fair_step1_final_knn_ep10_x99_res224_len180_train80_fixed/bimanual_push_box/rollout_eval.json` - `reports/rlbench_general_debug/rlbench_push_box_historical_step1_knn_ep10_x99_res224_len180_train80_fixed/bimanual_push_box/rollout_eval.json` Raw values: - current fair-step1 final mean success: `0.7` - current fair-step1 final successes: - `[1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0]` - historical push-box control mean success: `0.4` - historical push-box control successes: - `[0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0]` ### Official AnyBimanual overlap branch Sources: - `baselines/AnyBimanual_overlap_runs/peract_bc_subset3_overlap_smoke200_fixpretrain_nowandb3/PERACT_BC/seed0/training.log` - `reports/anybimanual_subset3_overlap_resume1000_eval.log` Raw train milestones: - global step `300`: loss `40.91718` - global step `400`: loss `33.26684` - global step `500`: loss `36.07054` - global step `600`: loss `35.32345` - global step `700`: loss `28.50959` - global step `800`: loss `23.60169` - global step `900`: loss `15.28901` - run reached `weights/1000` and the train exited cleanly Raw eval outputs: - source log: `reports/anybimanual_subset3_overlap_resume1000_eval.log` - summary files: - `VLAarchtests/artifacts/reports/anybimanual_overlap_baseline_20260330/resume1000_summary/summary.md` - `VLAarchtests/artifacts/reports/anybimanual_overlap_baseline_20260330/resume1000_summary/summary.json` - local last complete step: `1000` - local mean success: `0.16` - local per-task success: - `coordinated_push_box`: `0.0` - `coordinated_lift_ball`: `0.0` - `dual_push_buttons`: `0.48` - local per-task return: - `coordinated_push_box`: `0.0` - `coordinated_lift_ball`: `0.0` - `dual_push_buttons`: `12.0` - public best overlap step in the local summary: `60000` - public best mean success in the local summary: `0.6933333333333334` ### Validated general-task anchor: `dual_push_buttons` Sources: - `VLAarchtests/artifacts/reports/general_task_anchor_20260330_dual_push_buttons/summary.json` - `baselines/AnyBimanual_release_eval_anchor/perlf_release_dual_push_buttons_ep25/PERACT_BC/seed0/eval_data.csv` Raw values: - public AnyBimanual release, step `60000`: success `0.96`, return `24.0`, length `21.56` - local official single-task eval, step `60000`, `25` episodes: success `0.96`, return `24.0`, length `21.84` - local clip backbone-only result on same task: success `0.0`, return `0.0` - local elastic reveal proxy iter6 result on same task: success `0.0`, return `0.0` - local RVT frozen fixed-bounds result on same task: success `0.0`, return `0.0` ### RVT overlap branch Sources: - `VLAarchtests/artifacts/reports/rvt_overlap_branch_20260330/summary.md` - `VLAarchtests/artifacts/reports/rvt_overlap_branch_fixedbounds_20260330/summary.md` Raw values: - frozen RVT stage1 train summary: - `outputs/rlbench_rvt_branch/rlbench_subset3_backbone_only_rvt_100demo_frozen_seed17/summary.json` - final train total `0.043179353826920445` - final val total `0.039591669984665984` - frozen RVT overlap eval: mean success `0.0` - frozen fixed-bounds RVT overlap eval: mean success `0.0` - both branch gates: - local AnyBimanual overlap floor `0.16` - stage2 run `false` ### Dual-push non-privileged retarget branch Sources: - `VLAarchtests/artifacts/reports/dual_push_nonzero_branch_20260330/summary.md` Raw values: - demo replay through `absolute_action_from_delta`: - `reports/dual_push_nonzero_branch_20260330/demo_replay/replay_summary.json` - mean success `0.8` - mean return `0.8` - retargeted demo with checkpoint backbone retrieval and vision-only button localization: - `reports/dual_push_nonzero_branch_20260330/retargeted_demo_backbone_vision_ep1/summary.json` - mean success `1.0` - mean return `1.0` - retargeted demo with checkpoint backbone retrieval and vision-only button localization: - `reports/dual_push_nonzero_branch_20260330/retargeted_demo_backbone_vision_ep5/summary.json` - mean success `1.0` - mean return `1.0` ### Dual-push full-architecture hybrid branch Sources: - `VLAarchtests/artifacts/reports/dual_push_full_arch_hybrid_20260331/summary.md` - `reports/dual_push_full_arch_probe_iter6_scene_ep1/summary.json` - `reports/dual_push_full_arch_hybrid_iter6_backbone_ep1/summary.json` Raw values: - elastic checkpoint retargeted-demo probe with scene retrieval and vision-only button localization: - `1` episode - mean success `1.0` - mean return `1.0` - steps `94` - retrieved episode index `11` - retrieval similarity `0.9998629689216614` - full-architecture hybrid eval with elastic controller checkpoint plus dual-push retrieval checkpoint: - `1` episode - mean success `1.0` - mean return `1.0` - steps `116` - path recoveries `0` - noop fallbacks `0` - first selected mode `residual::maintain_opening` - last selected mode `residual::base_action` ## Environment Recreation Environment files are under `environment/`, including: - `environment/setup_same_hardware.sh` - `environment/runtime_env_vars.sh` - `environment/reconstruct_anybimanual_overlap_replay.sh` - `environment/hardware_snapshot.txt` - `environment/env_list.txt` - `environment/base_python.txt` - `environment/base_pip_freeze.txt` - `environment/rlbench_python.txt` - `environment/rlbench_pip_freeze.txt` ## Notes On Result Presentation This repo-level README and the new root docs intentionally keep result text raw: - file paths - exact commands - exact numeric outputs - exact partial status for in-flight runs Interpretive material already present inside older staged artifacts remains preserved as part of the historical workspace contents.