wsagi commited on
Commit
e7a804e
·
verified ·
1 Parent(s): 018a3a4

Add files using upload-large-folder tool

Browse files
Files changed (4) hide show
  1. README.md +31 -31
  2. config.json +2 -2
  3. model.safetensors +1 -1
  4. train_config.json +5 -5
README.md CHANGED
@@ -36,41 +36,32 @@ _An [X-VLA](https://arxiv.org/abs/2510.10274) (Florence2 + Soft-Prompted Transfo
36
  _Single-arm SO-101 picks 3 oranges sequentially and places each in a plate._
37
  - **数据集 / Dataset**:[`LightwheelAI/leisaac-pick-orange`](https://huggingface.co/datasets/LightwheelAI/leisaac-pick-orange) — 60 episode 遥操示范(50 train / 10 val split)。
38
  - **架构 / Architecture**:X-VLA — Florence2 vision-language encoder + Soft-Prompted Transformer + Rectified-Flow action head(10 denoising steps)。chunk_size=32,n_obs_steps=2。
39
- - **训练 / Training**:batch=8 / lr=1e-4 / **15k step**(10k 从 base + 5k resume)/ **weak image-aug (brightness ±5% only)** / GRIPPER_SCALE=5 / ~26 min on RTX 4090。
40
- - **评测 / Eval**:Isaac Sim 5.1 + LeIsaac,6 round × 60s。**7/18 oranges 放置 + 1/6 formal env success**,best ep = **3/3 全部完成 @ 31s** (32% 时间预算)
41
  - **⚠️ 关键 inference 配置 / Critical inference setting**:`n_action_steps=32`(chunk_size 整 reuse)。
42
  默认 `n_action_steps=8` 在此 ckpt 上 6-round = **0/18 灾难性失败**(每步重 plan 互相冲突)。详见下方 [Inference caveat](#-推理关键配置--critical-inference-caveat)。
43
 
44
  ## 模型亮点
45
  _Highlights_
46
 
47
- - **session 首次单 ep formal Isaac Sim env success + 31s 完成**:ep1 = `[True, True, True] @ 31.0s`。其他 baseline / retrain 配置整 90+ 实验从未达到 env 端确认 task done
48
- _First single-episode formal Isaac Sim env-side task completion (env reports `done=True`) in our 90+ hyperparam sweepep1 placed all 3 oranges in 31.0s, well under the 90s wall cap._
49
  - **暴露了 `n_action_steps` 的关键作用**:从 default 8 改 32 是 session 中唯一可靠的 3.5× baseline 提升。
50
  _Exposes `n_action_steps` as the single most reliable improvement — switching from default 8 to chunk_size=32 (full chunk reuse) gave ~3.5× over baseline._
51
- - **Weak image-aug 是唯一 aggregate 正向 retrain**:lerobot 默认 ColorJitter+Sharp+Affine 在 50-demo 数据集是 over-regularize(13% per-ep);只保留 brightness ±5%(max_num_transforms=1)反而 +5.6% 真胜 baseline。
52
- _Out of 6 retrain experiments (velocity-reweight, L1 loss, default image-aug, weak image-aug, body-desc, L1+aug compound), **only weak image-aug was net positive**. Default aug strength was harmful (-11.1% vs baseline); minimal brightness-only aug + 15k step gave the result above._
53
 
54
  ## 训练配方
55
  _Training recipe_
56
 
57
  ```bash
58
- # 1. 第一段 10k step from lerobot/xvla-base
59
  WEAK_IMAGE_AUG=1 \
60
  BATCH_SIZE=8 \
61
  MAX_STEPS=10000 \
62
  SAVE_FREQ=500 \
63
  OUTPUT_DIR=$LEISAAC/outputs/xvla-leisaac-pick-orange.weakaug \
64
  bash LeIsaac/scripts/finetune/xvla/train.sh
65
-
66
- # 2. 第二段 resume 10k → 15k(关键,10k 时刻还未出现 formal success)
67
- WEAK_IMAGE_AUG=1 \
68
- BATCH_SIZE=8 \
69
- MAX_STEPS=15000 \
70
- SAVE_FREQ=500 \
71
- RESUME=true \
72
- OUTPUT_DIR=$LEISAAC/outputs/xvla-leisaac-pick-orange.weakaug \
73
- bash LeIsaac/scripts/finetune/xvla/train.sh
74
  ```
75
 
76
  `WEAK_IMAGE_AUG=1` 在 [`train.sh`](https://github.com/vitorcen/LeIsaac-Training/blob/main/scripts/finetune/xvla/train.sh) 内展开为:
@@ -101,10 +92,10 @@ bash server/serve_xvla.sh --detach
101
  POLICY_PORT=5558 \
102
  POLICY_TIMEOUT_MS=3000 \
103
  ACTION_HORIZON=1 \
104
- EVAL_ROUNDS=6 \
105
- EPISODE_LENGTH=60 \
106
  PROMPT="Pick up the orange and put it in the plate" \
107
- MAX_ROUND_WALL_S=90 \
108
  bash server/eval_pi05.sh
109
  ```
110
 
@@ -116,24 +107,33 @@ bash server/eval_pi05.sh
116
  |---|---|---|---|
117
  | 8 (lerobot default) | **0/18** ❌ | 0% | 每步 replan,chunk[0]→chunk[0]→... 互相打架 |
118
  | 16 | 4/18 | 22% | 部分 chunk 复用 |
119
- | **32 (= chunk_size)** | **7/18 + 1 success** ⭐ | **39%** | 全 chunk 复用,单 chunk 自洽 |
120
 
121
  **X-VLA 的 RF action head 一次性生成 32-step chunk,必须让 chunk 在 env 里全部展开**才能体现其规划价值。每步 re-plan 反而让 chunk 序列错位。
122
 
123
  ## 评测结果
124
  _Evaluation_
125
 
126
- 6 round × 60s × Isaac Sim PickOrange (`LeIsaac-SO101-PickOrange-v0`),`max_round_wall_s=90`:
127
 
128
  | Episode | oranges placed | wall time | 备注 |
129
  |---|---|---|---|
130
- | 1 | **3/3** | **31.0s** | **formal env success** ⭐⭐⭐ |
131
- | 2 | 2/3 | 90.0s | wall_cap |
132
- | 3 | 0/3 | 90.1s | wall_cap |
133
- | 4 | 2/3 | 90.0s | wall_cap |
134
- | 5 | 0/3 | 90.0s | wall_cap |
135
- | 6 | 0/3 | 90.0s | wall_cap |
136
- | **Total** | **7/18 (39%)** | — | **1/6 success rate** |
 
 
 
 
 
 
 
 
 
137
 
138
  ### 完整 retrain 实验聚合表
139
 
@@ -165,9 +165,9 @@ _Negative findings — DO NOT repeat_
165
 
166
  ## 限制 / Limitations
167
 
168
- - **样本数小**:~39% per-ep 是 6-round (18 ep) 估计,置信区间 ±15%。15k 的 `1/6 formal success + 3/3@31s` 是单 ep 突破,未做 12-round 复测以确认稳定性Replicate 看是否能保持。
169
- - **数据集只有 50 demo**:retrain 改 loss / aug 普遍过激;扩到 80-100 demo 应能突破当前 39% 上限。
170
- - **place 子任务多模态**:模型偶尔抓起后悬空抖动(中间 ckpts 14/13k 频发)。可能需要 DAgger 或 synthetic relabel 修 covariate shift。
171
  - **chunk_size=32 与 wall_clock**:1 chunk = 32 step × 33ms ≈ 1s 规划周期。比 ACT (chunk=100, 3.3s 周期) 灵活但比 DP DDIM-32 慢(200ms 周期)。
172
 
173
  ## 引用 / Citations
 
36
  _Single-arm SO-101 picks 3 oranges sequentially and places each in a plate._
37
  - **数据集 / Dataset**:[`LightwheelAI/leisaac-pick-orange`](https://huggingface.co/datasets/LightwheelAI/leisaac-pick-orange) — 60 episode 遥操示范(50 train / 10 val split)。
38
  - **架构 / Architecture**:X-VLA — Florence2 vision-language encoder + Soft-Prompted Transformer + Rectified-Flow action head(10 denoising steps)。chunk_size=32,n_obs_steps=2。
39
+ - **训练 / Training**:batch=8 / lr=1e-4 / **10k step** / **weak image-aug (brightness ±5% only)** / GRIPPER_SCALE=5 / ~18 min on RTX 4090。
40
+ - **评测 / Eval**(benchmark-aligned 3 round × 120s sim × 180s wall_cap,与 leaderboard 其他 baseline 同条件):**4/9 oranges (44%)**,**ep2 = [T, T, T] 3/3**
41
  - **⚠️ 关键 inference 配置 / Critical inference setting**:`n_action_steps=32`(chunk_size 整 reuse)。
42
  默认 `n_action_steps=8` 在此 ckpt 上 6-round = **0/18 灾难性失败**(每步重 plan 互相冲突)。详见下方 [Inference caveat](#-推理关键配置--critical-inference-caveat)。
43
 
44
  ## 模型亮点
45
  _Highlights_
46
 
47
+ - **Benchmark setting (3 round × 120s sim × 180s wall_cap) ep2 = 3/3 perfect 全部完成**。其他 baseline (ACT, DP, X-VLA-15k) 在同条件下均无单 ep 3/3
48
+ _Under standardized benchmark conditions (matching leaderboard protocol), ep2 placed all 3 orangesa feat not achieved by ACT, DP, or X-VLA-15k under the same evaluation._
49
  - **暴露了 `n_action_steps` 的关键作用**:从 default 8 改 32 是 session 中唯一可靠的 3.5× baseline 提升。
50
  _Exposes `n_action_steps` as the single most reliable improvement — switching from default 8 to chunk_size=32 (full chunk reuse) gave ~3.5× over baseline._
51
+ - **Weak image-aug 是唯一 aggregate 正向 retrain**:lerobot 默认 ColorJitter+Sharp+Affine 在 50-demo 数据集是 over-regularize(13% per-ep);只保留 brightness ±5%(max_num_transforms=1)反而 +5.6% 真胜 baseline,10k 达到 44% per-ep
52
+ _Out of 6 retrain experiments (velocity-reweight, L1 loss, default image-aug, weak image-aug, body-desc, L1+aug compound), **only weak image-aug was net positive**. Default aug strength was harmful (-11.1% vs baseline); minimal brightness-only aug at 10k step gave 44% per-ep on benchmark._
53
 
54
  ## 训练配方
55
  _Training recipe_
56
 
57
  ```bash
58
+ # 一段 10k step from lerobot/xvla-base
59
  WEAK_IMAGE_AUG=1 \
60
  BATCH_SIZE=8 \
61
  MAX_STEPS=10000 \
62
  SAVE_FREQ=500 \
63
  OUTPUT_DIR=$LEISAAC/outputs/xvla-leisaac-pick-orange.weakaug \
64
  bash LeIsaac/scripts/finetune/xvla/train.sh
 
 
 
 
 
 
 
 
 
65
  ```
66
 
67
  `WEAK_IMAGE_AUG=1` 在 [`train.sh`](https://github.com/vitorcen/LeIsaac-Training/blob/main/scripts/finetune/xvla/train.sh) 内展开为:
 
92
  POLICY_PORT=5558 \
93
  POLICY_TIMEOUT_MS=3000 \
94
  ACTION_HORIZON=1 \
95
+ EVAL_ROUNDS=3 \
96
+ EPISODE_LENGTH=120 \
97
  PROMPT="Pick up the orange and put it in the plate" \
98
+ MAX_ROUND_WALL_S=180 \
99
  bash server/eval_pi05.sh
100
  ```
101
 
 
107
  |---|---|---|---|
108
  | 8 (lerobot default) | **0/18** ❌ | 0% | 每步 replan,chunk[0]→chunk[0]→... 互相打架 |
109
  | 16 | 4/18 | 22% | 部分 chunk 复用 |
110
+ | **32 (= chunk_size)** | **6/18 + 3/3 perfect** ⭐ | **33%** | 全 chunk 复用,单 chunk 自洽 |
111
 
112
  **X-VLA 的 RF action head 一次性生成 32-step chunk,必须让 chunk 在 env 里全部展开**才能体现其规划价值。每步 re-plan 反而让 chunk 序列错位。
113
 
114
  ## 评测结果
115
  _Evaluation_
116
 
117
+ ### Benchmark-aligned (3 round × 120s sim × 180s wall_cap) leaderboard 同条件
118
 
119
  | Episode | oranges placed | wall time | 备注 |
120
  |---|---|---|---|
121
+ | 1 | 1/3 | 180.1s | wall_cap |
122
+ | 2 | **3/3** | **180.0s** | **3/3 perfect** ⭐ |
123
+ | 3 | 0/3 | 180.1s | wall_cap |
124
+ | **Total** | **4/9 (44%)** | | 0/3 strict(env 未 report done,仅放对 3 颗)|
125
+
126
+ ### 6-round 扩展 eval (60s sim × 90s wall_cap)
127
+
128
+ | Episode | oranges placed | wall time |
129
+ |---|---|---|
130
+ | 1 | 1/3 | 90.0s |
131
+ | 2 | **3/3** ✅ | 90.0s |
132
+ | 3 | 0/3 | 90.0s |
133
+ | 4 | 1/3 | 90.1s |
134
+ | 5 | 0/3 | 90.0s |
135
+ | 6 | 1/3 | 90.1s |
136
+ | **Total** | **6/18 (33%)** | — |
137
 
138
  ### 完整 retrain 实验聚合表
139
 
 
165
 
166
  ## 限制 / Limitations
167
 
168
+ - **样本数小**:44% per-ep 是 benchmark 3-round (9 ep) 估计,置信区间 ±20%。6-round 扩展 = 33% (18 ep, CI ±15%)
169
+ - **数据集只有 50 demo**:retrain 改 loss / aug 普遍过激;扩到 80-100 demo 应能突破当前 ~44% per-ep 上限。
170
+ - **place 子任务多模态**:模型偶尔抓起后悬空抖动。可能需要 DAgger 或 synthetic relabel 修 covariate shift。
171
  - **chunk_size=32 与 wall_clock**:1 chunk = 32 step × 33ms ≈ 1s 规划周期。比 ACT (chunk=100, 3.3s 周期) 灵活但比 DP DDIM-32 慢(200ms 周期)。
172
 
173
  ## 引用 / Citations
config.json CHANGED
@@ -57,7 +57,7 @@
57
  "private": null,
58
  "tags": null,
59
  "license": null,
60
- "pretrained_path": "/home/david/work/isaaclab-experience/LeIsaac/outputs/xvla-leisaac-pick-orange.weakaug/checkpoints/last/pretrained_model",
61
  "chunk_size": 32,
62
  "n_action_steps": 8,
63
  "dtype": "bfloat16",
@@ -208,7 +208,7 @@
208
  224,
209
  224
210
  ],
211
- "num_image_views": 5,
212
  "empty_cameras": 1,
213
  "freeze_vision_encoder": true,
214
  "freeze_language_encoder": true,
 
57
  "private": null,
58
  "tags": null,
59
  "license": null,
60
+ "pretrained_path": "lerobot/xvla-base",
61
  "chunk_size": 32,
62
  "n_action_steps": 8,
63
  "dtype": "bfloat16",
 
208
  224,
209
  224
210
  ],
211
+ "num_image_views": 4,
212
  "empty_cameras": 1,
213
  "freeze_vision_encoder": true,
214
  "freeze_language_encoder": true,
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:5f83abe2909f4110de3c14a852948af9a2c3d32813af00e6cb28c91e0ed9c6d4
3
  size 1759596986
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a2142145a9c4fe618dd35ceaadba6b7e6be3391f2b09379d2d553efe951d718d
3
  size 1759596986
train_config.json CHANGED
@@ -137,7 +137,7 @@
137
  "private": null,
138
  "tags": null,
139
  "license": null,
140
- "pretrained_path": "/home/david/work/isaaclab-experience/LeIsaac/outputs/xvla-leisaac-pick-orange.weakaug/checkpoints/last/pretrained_model",
141
  "chunk_size": 32,
142
  "n_action_steps": 8,
143
  "dtype": "bfloat16",
@@ -288,7 +288,7 @@
288
  224,
289
  224
290
  ],
291
- "num_image_views": 5,
292
  "empty_cameras": 1,
293
  "freeze_vision_encoder": true,
294
  "freeze_language_encoder": true,
@@ -311,14 +311,14 @@
311
  "reward_model": null,
312
  "output_dir": "/home/david/work/isaaclab-experience/LeIsaac/outputs/xvla-leisaac-pick-orange.weakaug",
313
  "job_name": "xvla",
314
- "resume": true,
315
  "seed": 1000,
316
  "cudnn_deterministic": false,
317
  "num_workers": 4,
318
  "batch_size": 8,
319
  "prefetch_factor": 4,
320
  "persistent_workers": true,
321
- "steps": 15000,
322
  "eval_freq": 20000,
323
  "log_freq": 200,
324
  "tolerance_s": 0.0001,
@@ -366,5 +366,5 @@
366
  "observation.images.front": "observation.images.image",
367
  "observation.images.wrist": "observation.images.image2"
368
  },
369
- "checkpoint_path": "/home/david/work/isaaclab-experience/LeIsaac/outputs/xvla-leisaac-pick-orange.weakaug/checkpoints/last"
370
  }
 
137
  "private": null,
138
  "tags": null,
139
  "license": null,
140
+ "pretrained_path": "lerobot/xvla-base",
141
  "chunk_size": 32,
142
  "n_action_steps": 8,
143
  "dtype": "bfloat16",
 
288
  224,
289
  224
290
  ],
291
+ "num_image_views": 4,
292
  "empty_cameras": 1,
293
  "freeze_vision_encoder": true,
294
  "freeze_language_encoder": true,
 
311
  "reward_model": null,
312
  "output_dir": "/home/david/work/isaaclab-experience/LeIsaac/outputs/xvla-leisaac-pick-orange.weakaug",
313
  "job_name": "xvla",
314
+ "resume": false,
315
  "seed": 1000,
316
  "cudnn_deterministic": false,
317
  "num_workers": 4,
318
  "batch_size": 8,
319
  "prefetch_factor": 4,
320
  "persistent_workers": true,
321
+ "steps": 10000,
322
  "eval_freq": 20000,
323
  "log_freq": 200,
324
  "tolerance_s": 0.0001,
 
366
  "observation.images.front": "observation.images.image",
367
  "observation.images.wrist": "observation.images.image2"
368
  },
369
+ "checkpoint_path": null
370
  }