# VLAarchtests2

Bundle staged from `/workspace` on `2026-03-31 UTC`.

This repo is the follow-on organization repo to `lsnu/VLAarchtests`. It includes:

- current code under `VLAarchtests/`
- current third-party baseline code under `third_party/`
- current baseline runs, replay artifacts, demo roots, and released checkpoint material under `baselines/`
- current training outputs and checkpoints under `outputs/`
- current logs under `reports/`
- environment recreation files under `environment/`
- raw results and change/test logs at the repo root
- the previous repo README under `history/VLAarchtests_previous_README.md`
- the active handoff file under `handoff/instructions4.md`

## Top-Level Contents

- `VLAarchtests/`
  - code, tests, configs, generated configs, reports, checkpoints, and proxy datasets from the current runpod workspace
- `third_party/AnyBimanual/`
  - local AnyBimanual checkout used for the official overlap baseline branch, including local compatibility patches
- `baselines/`
  - released AnyBimanual checkpoint material
  - overlap replay artifacts
    - HF export packaging note: `baselines/AnyBimanual_overlap_replay/multi/` is sharded into subdirectories to satisfy the Hub `10000 files per directory` limit
  - overlap run directories
  - local subset3 demo roots used by the overlap branch
- `outputs/`
  - RLBench training outputs and checkpoints used by the current anchor, RVT, dual-push, and elastic-controller branches
- `reports/`
  - training and evaluation logs copied from `/workspace/reports`
- `environment/`
  - machine snapshot, package lists, and setup helpers
- `history/`
  - copied previous-repo README
- `handoff/`
  - active sprint instruction file
- `RESULTS_RAW.md`
  - raw result tables and final official overlap eval outputs
- `CHANGE_AND_TEST_LOG.md`
  - file-level change log and executed test commands
- `MODEL_AND_ARTIFACT_INDEX.md`
  - staged directory map with main artifact roots

## Previous Repo Coverage

The earlier `lsnu/VLAarchtests` repo covered the `2026-03-25/26` work. Its README is copied verbatim at:

- `history/VLAarchtests_previous_README.md`

Previous-repo items explicitly referenced there include:

- compact, spatial, compact-phase, and spatial-phase proxy branches
- earlier RLBench direct-policy and kNN runs
- environment recreation files
- prior raw result tables

## Current Session Additions

Current-session folders added or expanded in this repo include:

- `VLAarchtests/artifacts/reports/sprint_v7_summary/`
- `VLAarchtests/artifacts/reports/sprint_v7_followup/`
- `VLAarchtests/artifacts/reports/selector_finetune_v7_iterations/`
- `VLAarchtests/artifacts/reports/selector_finetune_v7_iter6/`
- `VLAarchtests/artifacts/reports/selector_finetune_v7_iter7/`
- `VLAarchtests/artifacts/reports/selector_finetune_v7_iter8/`
- `VLAarchtests/artifacts/reports/task_routed_proxy_v1/`
- `VLAarchtests/artifacts/reports/rlbench_general_debug_20260330/`
- `VLAarchtests/artifacts/reports/anybimanual_overlap_baseline_20260330/`
- `VLAarchtests/artifacts/reports/bag_mode_specialization_20260330/`
- `VLAarchtests/artifacts/reports/general_task_anchor_20260330_dual_push_buttons/`
- `VLAarchtests/artifacts/reports/rvt_overlap_branch_20260330/`
- `VLAarchtests/artifacts/reports/rvt_overlap_branch_fixedbounds_20260330/`
- `VLAarchtests/artifacts/reports/dual_push_nonzero_branch_20260330/`
- `VLAarchtests/artifacts/reports/dual_push_full_arch_hybrid_20260331/`

## Raw Results Snapshot

### Proxy sprint v7

Source:

- `VLAarchtests/artifacts/reports/sprint_v7_summary/reveal_sprint_summary_compact.json`

Raw values:

- base model mean success: `0.28`
- base per-task: foliage `0.39`, bag `0.31`, cloth `0.14`
- random mean success: `0.43333333333333335`
- candidate0 mean success: `0.2`
- oracle mean success: `0.4066666666666667`
- scripted mean success: `1.0`

### Eval-time ablations

Source:

- `VLAarchtests/artifacts/reports/sprint_v7_summary/reveal_sprint_summary_compact.json`

Raw values:

- `no_planner`: `0.2`
- `no_memory`: `0.3233333333333333`
- `no_task_conditioning`: `0.28`
- `no_geometry`: `0.27`
- `no_camera_pose`: `0.29333333333333333`

### Selector checkpoints

Sources:

- `VLAarchtests/artifacts/reports/selector_finetune_v7_iter6/default/reveal_benchmark.json`
- `VLAarchtests/artifacts/reports/selector_finetune_v7_iter7/full_fixed_default/reveal_benchmark.json`
- `VLAarchtests/artifacts/reports/selector_finetune_v7_iter8/bag_fixed_default/reveal_benchmark.json`
- `VLAarchtests/artifacts/reports/task_routed_proxy_v1/summary.md`

Raw values:

- `iter6` mean success: `0.4566666666666667`
  - foliage `0.46`, bag `0.4`, cloth `0.51`
- `iter7` mean success: `0.4666666666666666`
  - foliage `0.4`, bag `0.41`, cloth `0.59`
- `iter8` bag-only fixed slice: `0.41`
- routed controller mean success: `0.48666666666666664`
  - routing rule: `foliage -> iter6`, `bag -> iter8`, `cloth -> iter8`
  - per-task: foliage `0.46`, bag `0.41`, cloth `0.59`

### Real baseline compare on proxy suite

Source:

- `VLAarchtests/artifacts/reports/real_baseline_compare_v7_full/reveal_benchmark.json`

Raw values:

- `baseline_rgbd_stage3` mean success: `0.31`
  - foliage `0.21`, bag `0.15`, cloth `0.57`
- `iter5_selector` mean success: `0.45`
  - foliage `0.44`, bag `0.4`, cloth `0.51`

### RLBench recovered push-box comparator

Sources:

- `reports/rlbench_general_debug/rlbench_push_box_fair_step1_final_knn_ep10_x99_res224_len180_train80_fixed/bimanual_push_box/rollout_eval.json`
- `reports/rlbench_general_debug/rlbench_push_box_historical_step1_knn_ep10_x99_res224_len180_train80_fixed/bimanual_push_box/rollout_eval.json`

Raw values:

- current fair-step1 final mean success: `0.7`
- current fair-step1 final successes:
  - `[1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0]`
- historical push-box control mean success: `0.4`
- historical push-box control successes:
  - `[0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0]`

### Official AnyBimanual overlap branch

Sources:

- `baselines/AnyBimanual_overlap_runs/peract_bc_subset3_overlap_smoke200_fixpretrain_nowandb3/PERACT_BC/seed0/training.log`
- `reports/anybimanual_subset3_overlap_resume1000_eval.log`

Raw train milestones:

- global step `300`: loss `40.91718`
- global step `400`: loss `33.26684`
- global step `500`: loss `36.07054`
- global step `600`: loss `35.32345`
- global step `700`: loss `28.50959`
- global step `800`: loss `23.60169`
- global step `900`: loss `15.28901`
- run reached `weights/1000` and the train exited cleanly

Raw eval outputs:

- source log: `reports/anybimanual_subset3_overlap_resume1000_eval.log`
- summary files:
  - `VLAarchtests/artifacts/reports/anybimanual_overlap_baseline_20260330/resume1000_summary/summary.md`
  - `VLAarchtests/artifacts/reports/anybimanual_overlap_baseline_20260330/resume1000_summary/summary.json`
- local last complete step: `1000`
- local mean success: `0.16`
- local per-task success:
  - `coordinated_push_box`: `0.0`
  - `coordinated_lift_ball`: `0.0`
  - `dual_push_buttons`: `0.48`
- local per-task return:
  - `coordinated_push_box`: `0.0`
  - `coordinated_lift_ball`: `0.0`
  - `dual_push_buttons`: `12.0`
- public best overlap step in the local summary: `60000`
- public best mean success in the local summary: `0.6933333333333334`

### Validated general-task anchor: `dual_push_buttons`

Sources:

- `VLAarchtests/artifacts/reports/general_task_anchor_20260330_dual_push_buttons/summary.json`
- `baselines/AnyBimanual_release_eval_anchor/perlf_release_dual_push_buttons_ep25/PERACT_BC/seed0/eval_data.csv`

Raw values:

- public AnyBimanual release, step `60000`: success `0.96`, return `24.0`, length `21.56`
- local official single-task eval, step `60000`, `25` episodes: success `0.96`, return `24.0`, length `21.84`
- local clip backbone-only result on same task: success `0.0`, return `0.0`
- local elastic reveal proxy iter6 result on same task: success `0.0`, return `0.0`
- local RVT frozen fixed-bounds result on same task: success `0.0`, return `0.0`

### RVT overlap branch

Sources:

- `VLAarchtests/artifacts/reports/rvt_overlap_branch_20260330/summary.md`
- `VLAarchtests/artifacts/reports/rvt_overlap_branch_fixedbounds_20260330/summary.md`

Raw values:

- frozen RVT stage1 train summary:
  - `outputs/rlbench_rvt_branch/rlbench_subset3_backbone_only_rvt_100demo_frozen_seed17/summary.json`
  - final train total `0.043179353826920445`
  - final val total `0.039591669984665984`
- frozen RVT overlap eval: mean success `0.0`
- frozen fixed-bounds RVT overlap eval: mean success `0.0`
- both branch gates:
  - local AnyBimanual overlap floor `0.16`
  - stage2 run `false`

### Dual-push non-privileged retarget branch

Sources:

- `VLAarchtests/artifacts/reports/dual_push_nonzero_branch_20260330/summary.md`

Raw values:

- demo replay through `absolute_action_from_delta`:
  - `reports/dual_push_nonzero_branch_20260330/demo_replay/replay_summary.json`
  - mean success `0.8`
  - mean return `0.8`
- retargeted demo with checkpoint backbone retrieval and vision-only button localization:
  - `reports/dual_push_nonzero_branch_20260330/retargeted_demo_backbone_vision_ep1/summary.json`
  - mean success `1.0`
  - mean return `1.0`
- retargeted demo with checkpoint backbone retrieval and vision-only button localization:
  - `reports/dual_push_nonzero_branch_20260330/retargeted_demo_backbone_vision_ep5/summary.json`
  - mean success `1.0`
  - mean return `1.0`

### Dual-push full-architecture hybrid branch

Sources:

- `VLAarchtests/artifacts/reports/dual_push_full_arch_hybrid_20260331/summary.md`
- `reports/dual_push_full_arch_probe_iter6_scene_ep1/summary.json`
- `reports/dual_push_full_arch_hybrid_iter6_backbone_ep1/summary.json`

Raw values:

- elastic checkpoint retargeted-demo probe with scene retrieval and vision-only button localization:
  - `1` episode
  - mean success `1.0`
  - mean return `1.0`
  - steps `94`
  - retrieved episode index `11`
  - retrieval similarity `0.9998629689216614`
- full-architecture hybrid eval with elastic controller checkpoint plus dual-push retrieval checkpoint:
  - `1` episode
  - mean success `1.0`
  - mean return `1.0`
  - steps `116`
  - path recoveries `0`
  - noop fallbacks `0`
  - first selected mode `residual::maintain_opening`
  - last selected mode `residual::base_action`

## Environment Recreation

Environment files are under `environment/`, including:

- `environment/setup_same_hardware.sh`
- `environment/runtime_env_vars.sh`
- `environment/reconstruct_anybimanual_overlap_replay.sh`
- `environment/hardware_snapshot.txt`
- `environment/env_list.txt`
- `environment/base_python.txt`
- `environment/base_pip_freeze.txt`
- `environment/rlbench_python.txt`
- `environment/rlbench_pip_freeze.txt`

## Notes On Result Presentation

This repo-level README and the new root docs intentionally keep result text raw:

- file paths
- exact commands
- exact numeric outputs
- exact partial status for in-flight runs

Interpretive material already present inside older staged artifacts remains preserved as part of the historical workspace contents.