lsnu's picture
Add files using upload-large-folder tool
bfb9665 verified
# VLAarchtests2
Bundle staged from `/workspace` on `2026-03-31 UTC`.
This repo is the follow-on organization repo to `lsnu/VLAarchtests`. It includes:
- current code under `VLAarchtests/`
- current third-party baseline code under `third_party/`
- current baseline runs, replay artifacts, demo roots, and released checkpoint material under `baselines/`
- current training outputs and checkpoints under `outputs/`
- current logs under `reports/`
- environment recreation files under `environment/`
- raw results and change/test logs at the repo root
- the previous repo README under `history/VLAarchtests_previous_README.md`
- the active handoff file under `handoff/instructions4.md`
## Top-Level Contents
- `VLAarchtests/`
- code, tests, configs, generated configs, reports, checkpoints, and proxy datasets from the current runpod workspace
- `third_party/AnyBimanual/`
- local AnyBimanual checkout used for the official overlap baseline branch, including local compatibility patches
- `baselines/`
- released AnyBimanual checkpoint material
- overlap replay artifacts
- HF export packaging note: `baselines/AnyBimanual_overlap_replay/multi/` is sharded into subdirectories to satisfy the Hub `10000 files per directory` limit
- overlap run directories
- local subset3 demo roots used by the overlap branch
- `outputs/`
- RLBench training outputs and checkpoints used by the current anchor, RVT, dual-push, and elastic-controller branches
- `reports/`
- training and evaluation logs copied from `/workspace/reports`
- `environment/`
- machine snapshot, package lists, and setup helpers
- `history/`
- copied previous-repo README
- `handoff/`
- active sprint instruction file
- `RESULTS_RAW.md`
- raw result tables and final official overlap eval outputs
- `CHANGE_AND_TEST_LOG.md`
- file-level change log and executed test commands
- `MODEL_AND_ARTIFACT_INDEX.md`
- staged directory map with main artifact roots
## Previous Repo Coverage
The earlier `lsnu/VLAarchtests` repo covered the `2026-03-25/26` work. Its README is copied verbatim at:
- `history/VLAarchtests_previous_README.md`
Previous-repo items explicitly referenced there include:
- compact, spatial, compact-phase, and spatial-phase proxy branches
- earlier RLBench direct-policy and kNN runs
- environment recreation files
- prior raw result tables
## Current Session Additions
Current-session folders added or expanded in this repo include:
- `VLAarchtests/artifacts/reports/sprint_v7_summary/`
- `VLAarchtests/artifacts/reports/sprint_v7_followup/`
- `VLAarchtests/artifacts/reports/selector_finetune_v7_iterations/`
- `VLAarchtests/artifacts/reports/selector_finetune_v7_iter6/`
- `VLAarchtests/artifacts/reports/selector_finetune_v7_iter7/`
- `VLAarchtests/artifacts/reports/selector_finetune_v7_iter8/`
- `VLAarchtests/artifacts/reports/task_routed_proxy_v1/`
- `VLAarchtests/artifacts/reports/rlbench_general_debug_20260330/`
- `VLAarchtests/artifacts/reports/anybimanual_overlap_baseline_20260330/`
- `VLAarchtests/artifacts/reports/bag_mode_specialization_20260330/`
- `VLAarchtests/artifacts/reports/general_task_anchor_20260330_dual_push_buttons/`
- `VLAarchtests/artifacts/reports/rvt_overlap_branch_20260330/`
- `VLAarchtests/artifacts/reports/rvt_overlap_branch_fixedbounds_20260330/`
- `VLAarchtests/artifacts/reports/dual_push_nonzero_branch_20260330/`
- `VLAarchtests/artifacts/reports/dual_push_full_arch_hybrid_20260331/`
## Raw Results Snapshot
### Proxy sprint v7
Source:
- `VLAarchtests/artifacts/reports/sprint_v7_summary/reveal_sprint_summary_compact.json`
Raw values:
- base model mean success: `0.28`
- base per-task: foliage `0.39`, bag `0.31`, cloth `0.14`
- random mean success: `0.43333333333333335`
- candidate0 mean success: `0.2`
- oracle mean success: `0.4066666666666667`
- scripted mean success: `1.0`
### Eval-time ablations
Source:
- `VLAarchtests/artifacts/reports/sprint_v7_summary/reveal_sprint_summary_compact.json`
Raw values:
- `no_planner`: `0.2`
- `no_memory`: `0.3233333333333333`
- `no_task_conditioning`: `0.28`
- `no_geometry`: `0.27`
- `no_camera_pose`: `0.29333333333333333`
### Selector checkpoints
Sources:
- `VLAarchtests/artifacts/reports/selector_finetune_v7_iter6/default/reveal_benchmark.json`
- `VLAarchtests/artifacts/reports/selector_finetune_v7_iter7/full_fixed_default/reveal_benchmark.json`
- `VLAarchtests/artifacts/reports/selector_finetune_v7_iter8/bag_fixed_default/reveal_benchmark.json`
- `VLAarchtests/artifacts/reports/task_routed_proxy_v1/summary.md`
Raw values:
- `iter6` mean success: `0.4566666666666667`
- foliage `0.46`, bag `0.4`, cloth `0.51`
- `iter7` mean success: `0.4666666666666666`
- foliage `0.4`, bag `0.41`, cloth `0.59`
- `iter8` bag-only fixed slice: `0.41`
- routed controller mean success: `0.48666666666666664`
- routing rule: `foliage -> iter6`, `bag -> iter8`, `cloth -> iter8`
- per-task: foliage `0.46`, bag `0.41`, cloth `0.59`
### Real baseline compare on proxy suite
Source:
- `VLAarchtests/artifacts/reports/real_baseline_compare_v7_full/reveal_benchmark.json`
Raw values:
- `baseline_rgbd_stage3` mean success: `0.31`
- foliage `0.21`, bag `0.15`, cloth `0.57`
- `iter5_selector` mean success: `0.45`
- foliage `0.44`, bag `0.4`, cloth `0.51`
### RLBench recovered push-box comparator
Sources:
- `reports/rlbench_general_debug/rlbench_push_box_fair_step1_final_knn_ep10_x99_res224_len180_train80_fixed/bimanual_push_box/rollout_eval.json`
- `reports/rlbench_general_debug/rlbench_push_box_historical_step1_knn_ep10_x99_res224_len180_train80_fixed/bimanual_push_box/rollout_eval.json`
Raw values:
- current fair-step1 final mean success: `0.7`
- current fair-step1 final successes:
- `[1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0]`
- historical push-box control mean success: `0.4`
- historical push-box control successes:
- `[0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0]`
### Official AnyBimanual overlap branch
Sources:
- `baselines/AnyBimanual_overlap_runs/peract_bc_subset3_overlap_smoke200_fixpretrain_nowandb3/PERACT_BC/seed0/training.log`
- `reports/anybimanual_subset3_overlap_resume1000_eval.log`
Raw train milestones:
- global step `300`: loss `40.91718`
- global step `400`: loss `33.26684`
- global step `500`: loss `36.07054`
- global step `600`: loss `35.32345`
- global step `700`: loss `28.50959`
- global step `800`: loss `23.60169`
- global step `900`: loss `15.28901`
- run reached `weights/1000` and the train exited cleanly
Raw eval outputs:
- source log: `reports/anybimanual_subset3_overlap_resume1000_eval.log`
- summary files:
- `VLAarchtests/artifacts/reports/anybimanual_overlap_baseline_20260330/resume1000_summary/summary.md`
- `VLAarchtests/artifacts/reports/anybimanual_overlap_baseline_20260330/resume1000_summary/summary.json`
- local last complete step: `1000`
- local mean success: `0.16`
- local per-task success:
- `coordinated_push_box`: `0.0`
- `coordinated_lift_ball`: `0.0`
- `dual_push_buttons`: `0.48`
- local per-task return:
- `coordinated_push_box`: `0.0`
- `coordinated_lift_ball`: `0.0`
- `dual_push_buttons`: `12.0`
- public best overlap step in the local summary: `60000`
- public best mean success in the local summary: `0.6933333333333334`
### Validated general-task anchor: `dual_push_buttons`
Sources:
- `VLAarchtests/artifacts/reports/general_task_anchor_20260330_dual_push_buttons/summary.json`
- `baselines/AnyBimanual_release_eval_anchor/perlf_release_dual_push_buttons_ep25/PERACT_BC/seed0/eval_data.csv`
Raw values:
- public AnyBimanual release, step `60000`: success `0.96`, return `24.0`, length `21.56`
- local official single-task eval, step `60000`, `25` episodes: success `0.96`, return `24.0`, length `21.84`
- local clip backbone-only result on same task: success `0.0`, return `0.0`
- local elastic reveal proxy iter6 result on same task: success `0.0`, return `0.0`
- local RVT frozen fixed-bounds result on same task: success `0.0`, return `0.0`
### RVT overlap branch
Sources:
- `VLAarchtests/artifacts/reports/rvt_overlap_branch_20260330/summary.md`
- `VLAarchtests/artifacts/reports/rvt_overlap_branch_fixedbounds_20260330/summary.md`
Raw values:
- frozen RVT stage1 train summary:
- `outputs/rlbench_rvt_branch/rlbench_subset3_backbone_only_rvt_100demo_frozen_seed17/summary.json`
- final train total `0.043179353826920445`
- final val total `0.039591669984665984`
- frozen RVT overlap eval: mean success `0.0`
- frozen fixed-bounds RVT overlap eval: mean success `0.0`
- both branch gates:
- local AnyBimanual overlap floor `0.16`
- stage2 run `false`
### Dual-push non-privileged retarget branch
Sources:
- `VLAarchtests/artifacts/reports/dual_push_nonzero_branch_20260330/summary.md`
Raw values:
- demo replay through `absolute_action_from_delta`:
- `reports/dual_push_nonzero_branch_20260330/demo_replay/replay_summary.json`
- mean success `0.8`
- mean return `0.8`
- retargeted demo with checkpoint backbone retrieval and vision-only button localization:
- `reports/dual_push_nonzero_branch_20260330/retargeted_demo_backbone_vision_ep1/summary.json`
- mean success `1.0`
- mean return `1.0`
- retargeted demo with checkpoint backbone retrieval and vision-only button localization:
- `reports/dual_push_nonzero_branch_20260330/retargeted_demo_backbone_vision_ep5/summary.json`
- mean success `1.0`
- mean return `1.0`
### Dual-push full-architecture hybrid branch
Sources:
- `VLAarchtests/artifacts/reports/dual_push_full_arch_hybrid_20260331/summary.md`
- `reports/dual_push_full_arch_probe_iter6_scene_ep1/summary.json`
- `reports/dual_push_full_arch_hybrid_iter6_backbone_ep1/summary.json`
Raw values:
- elastic checkpoint retargeted-demo probe with scene retrieval and vision-only button localization:
- `1` episode
- mean success `1.0`
- mean return `1.0`
- steps `94`
- retrieved episode index `11`
- retrieval similarity `0.9998629689216614`
- full-architecture hybrid eval with elastic controller checkpoint plus dual-push retrieval checkpoint:
- `1` episode
- mean success `1.0`
- mean return `1.0`
- steps `116`
- path recoveries `0`
- noop fallbacks `0`
- first selected mode `residual::maintain_opening`
- last selected mode `residual::base_action`
## Environment Recreation
Environment files are under `environment/`, including:
- `environment/setup_same_hardware.sh`
- `environment/runtime_env_vars.sh`
- `environment/reconstruct_anybimanual_overlap_replay.sh`
- `environment/hardware_snapshot.txt`
- `environment/env_list.txt`
- `environment/base_python.txt`
- `environment/base_pip_freeze.txt`
- `environment/rlbench_python.txt`
- `environment/rlbench_pip_freeze.txt`
## Notes On Result Presentation
This repo-level README and the new root docs intentionally keep result text raw:
- file paths
- exact commands
- exact numeric outputs
- exact partial status for in-flight runs
Interpretive material already present inside older staged artifacts remains preserved as part of the historical workspace contents.