File size: 11,073 Bytes
bfb9665
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
# VLAarchtests2

Bundle staged from `/workspace` on `2026-03-31 UTC`.

This repo is the follow-on organization repo to `lsnu/VLAarchtests`. It includes:

- current code under `VLAarchtests/`
- current third-party baseline code under `third_party/`
- current baseline runs, replay artifacts, demo roots, and released checkpoint material under `baselines/`
- current training outputs and checkpoints under `outputs/`
- current logs under `reports/`
- environment recreation files under `environment/`
- raw results and change/test logs at the repo root
- the previous repo README under `history/VLAarchtests_previous_README.md`
- the active handoff file under `handoff/instructions4.md`

## Top-Level Contents

- `VLAarchtests/`
  - code, tests, configs, generated configs, reports, checkpoints, and proxy datasets from the current runpod workspace
- `third_party/AnyBimanual/`
  - local AnyBimanual checkout used for the official overlap baseline branch, including local compatibility patches
- `baselines/`
  - released AnyBimanual checkpoint material
  - overlap replay artifacts
    - HF export packaging note: `baselines/AnyBimanual_overlap_replay/multi/` is sharded into subdirectories to satisfy the Hub `10000 files per directory` limit
  - overlap run directories
  - local subset3 demo roots used by the overlap branch
- `outputs/`
  - RLBench training outputs and checkpoints used by the current anchor, RVT, dual-push, and elastic-controller branches
- `reports/`
  - training and evaluation logs copied from `/workspace/reports`
- `environment/`
  - machine snapshot, package lists, and setup helpers
- `history/`
  - copied previous-repo README
- `handoff/`
  - active sprint instruction file
- `RESULTS_RAW.md`
  - raw result tables and final official overlap eval outputs
- `CHANGE_AND_TEST_LOG.md`
  - file-level change log and executed test commands
- `MODEL_AND_ARTIFACT_INDEX.md`
  - staged directory map with main artifact roots

## Previous Repo Coverage

The earlier `lsnu/VLAarchtests` repo covered the `2026-03-25/26` work. Its README is copied verbatim at:

- `history/VLAarchtests_previous_README.md`

Previous-repo items explicitly referenced there include:

- compact, spatial, compact-phase, and spatial-phase proxy branches
- earlier RLBench direct-policy and kNN runs
- environment recreation files
- prior raw result tables

## Current Session Additions

Current-session folders added or expanded in this repo include:

- `VLAarchtests/artifacts/reports/sprint_v7_summary/`
- `VLAarchtests/artifacts/reports/sprint_v7_followup/`
- `VLAarchtests/artifacts/reports/selector_finetune_v7_iterations/`
- `VLAarchtests/artifacts/reports/selector_finetune_v7_iter6/`
- `VLAarchtests/artifacts/reports/selector_finetune_v7_iter7/`
- `VLAarchtests/artifacts/reports/selector_finetune_v7_iter8/`
- `VLAarchtests/artifacts/reports/task_routed_proxy_v1/`
- `VLAarchtests/artifacts/reports/rlbench_general_debug_20260330/`
- `VLAarchtests/artifacts/reports/anybimanual_overlap_baseline_20260330/`
- `VLAarchtests/artifacts/reports/bag_mode_specialization_20260330/`
- `VLAarchtests/artifacts/reports/general_task_anchor_20260330_dual_push_buttons/`
- `VLAarchtests/artifacts/reports/rvt_overlap_branch_20260330/`
- `VLAarchtests/artifacts/reports/rvt_overlap_branch_fixedbounds_20260330/`
- `VLAarchtests/artifacts/reports/dual_push_nonzero_branch_20260330/`
- `VLAarchtests/artifacts/reports/dual_push_full_arch_hybrid_20260331/`

## Raw Results Snapshot

### Proxy sprint v7

Source:

- `VLAarchtests/artifacts/reports/sprint_v7_summary/reveal_sprint_summary_compact.json`

Raw values:

- base model mean success: `0.28`
- base per-task: foliage `0.39`, bag `0.31`, cloth `0.14`
- random mean success: `0.43333333333333335`
- candidate0 mean success: `0.2`
- oracle mean success: `0.4066666666666667`
- scripted mean success: `1.0`

### Eval-time ablations

Source:

- `VLAarchtests/artifacts/reports/sprint_v7_summary/reveal_sprint_summary_compact.json`

Raw values:

- `no_planner`: `0.2`
- `no_memory`: `0.3233333333333333`
- `no_task_conditioning`: `0.28`
- `no_geometry`: `0.27`
- `no_camera_pose`: `0.29333333333333333`

### Selector checkpoints

Sources:

- `VLAarchtests/artifacts/reports/selector_finetune_v7_iter6/default/reveal_benchmark.json`
- `VLAarchtests/artifacts/reports/selector_finetune_v7_iter7/full_fixed_default/reveal_benchmark.json`
- `VLAarchtests/artifacts/reports/selector_finetune_v7_iter8/bag_fixed_default/reveal_benchmark.json`
- `VLAarchtests/artifacts/reports/task_routed_proxy_v1/summary.md`

Raw values:

- `iter6` mean success: `0.4566666666666667`
  - foliage `0.46`, bag `0.4`, cloth `0.51`
- `iter7` mean success: `0.4666666666666666`
  - foliage `0.4`, bag `0.41`, cloth `0.59`
- `iter8` bag-only fixed slice: `0.41`
- routed controller mean success: `0.48666666666666664`
  - routing rule: `foliage -> iter6`, `bag -> iter8`, `cloth -> iter8`
  - per-task: foliage `0.46`, bag `0.41`, cloth `0.59`

### Real baseline compare on proxy suite

Source:

- `VLAarchtests/artifacts/reports/real_baseline_compare_v7_full/reveal_benchmark.json`

Raw values:

- `baseline_rgbd_stage3` mean success: `0.31`
  - foliage `0.21`, bag `0.15`, cloth `0.57`
- `iter5_selector` mean success: `0.45`
  - foliage `0.44`, bag `0.4`, cloth `0.51`

### RLBench recovered push-box comparator

Sources:

- `reports/rlbench_general_debug/rlbench_push_box_fair_step1_final_knn_ep10_x99_res224_len180_train80_fixed/bimanual_push_box/rollout_eval.json`
- `reports/rlbench_general_debug/rlbench_push_box_historical_step1_knn_ep10_x99_res224_len180_train80_fixed/bimanual_push_box/rollout_eval.json`

Raw values:

- current fair-step1 final mean success: `0.7`
- current fair-step1 final successes:
  - `[1.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0]`
- historical push-box control mean success: `0.4`
- historical push-box control successes:
  - `[0.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0]`

### Official AnyBimanual overlap branch

Sources:

- `baselines/AnyBimanual_overlap_runs/peract_bc_subset3_overlap_smoke200_fixpretrain_nowandb3/PERACT_BC/seed0/training.log`
- `reports/anybimanual_subset3_overlap_resume1000_eval.log`

Raw train milestones:

- global step `300`: loss `40.91718`
- global step `400`: loss `33.26684`
- global step `500`: loss `36.07054`
- global step `600`: loss `35.32345`
- global step `700`: loss `28.50959`
- global step `800`: loss `23.60169`
- global step `900`: loss `15.28901`
- run reached `weights/1000` and the train exited cleanly

Raw eval outputs:

- source log: `reports/anybimanual_subset3_overlap_resume1000_eval.log`
- summary files:
  - `VLAarchtests/artifacts/reports/anybimanual_overlap_baseline_20260330/resume1000_summary/summary.md`
  - `VLAarchtests/artifacts/reports/anybimanual_overlap_baseline_20260330/resume1000_summary/summary.json`
- local last complete step: `1000`
- local mean success: `0.16`
- local per-task success:
  - `coordinated_push_box`: `0.0`
  - `coordinated_lift_ball`: `0.0`
  - `dual_push_buttons`: `0.48`
- local per-task return:
  - `coordinated_push_box`: `0.0`
  - `coordinated_lift_ball`: `0.0`
  - `dual_push_buttons`: `12.0`
- public best overlap step in the local summary: `60000`
- public best mean success in the local summary: `0.6933333333333334`

### Validated general-task anchor: `dual_push_buttons`

Sources:

- `VLAarchtests/artifacts/reports/general_task_anchor_20260330_dual_push_buttons/summary.json`
- `baselines/AnyBimanual_release_eval_anchor/perlf_release_dual_push_buttons_ep25/PERACT_BC/seed0/eval_data.csv`

Raw values:

- public AnyBimanual release, step `60000`: success `0.96`, return `24.0`, length `21.56`
- local official single-task eval, step `60000`, `25` episodes: success `0.96`, return `24.0`, length `21.84`
- local clip backbone-only result on same task: success `0.0`, return `0.0`
- local elastic reveal proxy iter6 result on same task: success `0.0`, return `0.0`
- local RVT frozen fixed-bounds result on same task: success `0.0`, return `0.0`

### RVT overlap branch

Sources:

- `VLAarchtests/artifacts/reports/rvt_overlap_branch_20260330/summary.md`
- `VLAarchtests/artifacts/reports/rvt_overlap_branch_fixedbounds_20260330/summary.md`

Raw values:

- frozen RVT stage1 train summary:
  - `outputs/rlbench_rvt_branch/rlbench_subset3_backbone_only_rvt_100demo_frozen_seed17/summary.json`
  - final train total `0.043179353826920445`
  - final val total `0.039591669984665984`
- frozen RVT overlap eval: mean success `0.0`
- frozen fixed-bounds RVT overlap eval: mean success `0.0`
- both branch gates:
  - local AnyBimanual overlap floor `0.16`
  - stage2 run `false`

### Dual-push non-privileged retarget branch

Sources:

- `VLAarchtests/artifacts/reports/dual_push_nonzero_branch_20260330/summary.md`

Raw values:

- demo replay through `absolute_action_from_delta`:
  - `reports/dual_push_nonzero_branch_20260330/demo_replay/replay_summary.json`
  - mean success `0.8`
  - mean return `0.8`
- retargeted demo with checkpoint backbone retrieval and vision-only button localization:
  - `reports/dual_push_nonzero_branch_20260330/retargeted_demo_backbone_vision_ep1/summary.json`
  - mean success `1.0`
  - mean return `1.0`
- retargeted demo with checkpoint backbone retrieval and vision-only button localization:
  - `reports/dual_push_nonzero_branch_20260330/retargeted_demo_backbone_vision_ep5/summary.json`
  - mean success `1.0`
  - mean return `1.0`

### Dual-push full-architecture hybrid branch

Sources:

- `VLAarchtests/artifacts/reports/dual_push_full_arch_hybrid_20260331/summary.md`
- `reports/dual_push_full_arch_probe_iter6_scene_ep1/summary.json`
- `reports/dual_push_full_arch_hybrid_iter6_backbone_ep1/summary.json`

Raw values:

- elastic checkpoint retargeted-demo probe with scene retrieval and vision-only button localization:
  - `1` episode
  - mean success `1.0`
  - mean return `1.0`
  - steps `94`
  - retrieved episode index `11`
  - retrieval similarity `0.9998629689216614`
- full-architecture hybrid eval with elastic controller checkpoint plus dual-push retrieval checkpoint:
  - `1` episode
  - mean success `1.0`
  - mean return `1.0`
  - steps `116`
  - path recoveries `0`
  - noop fallbacks `0`
  - first selected mode `residual::maintain_opening`
  - last selected mode `residual::base_action`

## Environment Recreation

Environment files are under `environment/`, including:

- `environment/setup_same_hardware.sh`
- `environment/runtime_env_vars.sh`
- `environment/reconstruct_anybimanual_overlap_replay.sh`
- `environment/hardware_snapshot.txt`
- `environment/env_list.txt`
- `environment/base_python.txt`
- `environment/base_pip_freeze.txt`
- `environment/rlbench_python.txt`
- `environment/rlbench_pip_freeze.txt`

## Notes On Result Presentation

This repo-level README and the new root docs intentionally keep result text raw:

- file paths
- exact commands
- exact numeric outputs
- exact partial status for in-flight runs

Interpretive material already present inside older staged artifacts remains preserved as part of the historical workspace contents.