File size: 9,684 Bytes
018a3a4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96def2f
 
018a3a4
 
 
 
 
 
96def2f
 
018a3a4
 
96def2f
 
018a3a4
 
 
 
 
96def2f
 
 
 
 
018a3a4
 
 
 
 
 
 
 
 
 
 
 
96def2f
 
 
 
018a3a4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96def2f
 
018a3a4
96def2f
018a3a4
 
 
 
 
 
 
 
 
 
 
96def2f
018a3a4
 
 
 
 
 
96def2f
e7a804e
96def2f
 
 
 
 
 
7d4fa8f
96def2f
7d4fa8f
96def2f
 
 
 
 
 
 
 
 
018a3a4
 
 
96def2f
018a3a4
 
 
 
 
 
 
 
96def2f
018a3a4
 
 
 
96def2f
018a3a4
96def2f
018a3a4
 
 
 
 
 
 
 
 
 
 
96def2f
 
 
 
018a3a4
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
---
license: apache-2.0
library_name: lerobot
pipeline_tag: robotics
tags:
  - xvla
  - x-vla
  - lerobot
  - so101
  - leisaac
  - pick-orange
  - isaac-sim
  - rectified-flow
  - florence2
datasets:
  - LightwheelAI/leisaac-pick-orange
language:
  - en
base_model: lerobot/xvla-base
---

# X-VLA-PickOrange

针对 [LeIsaac SO-101 PickOrange](https://github.com/LightwheelAI/leisaac) 任务从 [X-VLA-base](https://huggingface.co/lerobot/xvla-base) 微调的 [X-VLA](https://arxiv.org/abs/2510.10274) (Florence2 + Soft-Prompted Transformer + Rectified-Flow action head, 0.9B params) 策略。
_An [X-VLA](https://arxiv.org/abs/2510.10274) (Florence2 + Soft-Prompted Transformer + Rectified-Flow action head, 0.9 B params) policy fine-tuned from [X-VLA-base](https://huggingface.co/lerobot/xvla-base) on the [LeIsaac SO-101 PickOrange](https://github.com/LightwheelAI/leisaac) task._

![X-VLA-PickOrange — SO-101 in Isaac Sim](x-vla-pick-orange.jpg)

**🔗 项目仓库 / Project repos**- [vitorcen/isaaclab-experience](https://github.com/vitorcen/isaaclab-experience) — Isaac Lab + LeIsaac 多策略横评(parent project)
- [vitorcen/LeIsaac-Training](https://github.com/vitorcen/LeIsaac-Training) — LeIsaac fork(训练脚本 + 设计文档 / training scripts + design docs)

## TL;DR

- **任务 / Task**`Pick up the orange and put it in the plate` — SO-101 单臂依次夹起 3 颗橙子并放盘子。
  _Single-arm SO-101 picks 3 oranges sequentially and places each in a plate._
- **数据集 / Dataset**:[`LightwheelAI/leisaac-pick-orange`](https://huggingface.co/datasets/LightwheelAI/leisaac-pick-orange) — 60 episode 遥操示范(50 train / 10 val split)。
- **架构 / Architecture**:X-VLA — Florence2 vision-language encoder + Soft-Prompted Transformer + Rectified-Flow action head(10 denoising steps)。chunk_size=32,n_obs_steps=2。
- **训练 / Training**:batch=8 / lr=1e-4 / **10k step** / **weak image-aug (brightness ±5% only)** / GRIPPER_SCALE=5 / ~18 min on RTX 4090。
- **评测 / Eval**(benchmark-aligned 3 round × 120s sim × 180s wall_cap,与 leaderboard 其他 baseline 同条件):**4/9 oranges (44%)****ep2 = [T, T, T] 3/3** ⭐。
- **⚠️ 关键 inference 配置 / Critical inference setting**:`n_action_steps=32`(chunk_size 整 reuse)。
  默认 `n_action_steps=8` 在此 ckpt 上 6-round = **0/18 灾难性失败**(每步重 plan 互相冲突)。详见下方 [Inference caveat](#-推理关键配置--critical-inference-caveat)。

## 模型亮点
_Highlights_

- **Benchmark setting (3 round × 120s sim × 180s wall_cap) 下 ep2 = 3/3 perfect 全部完成**。其他 baseline (ACT, DP, X-VLA-15k) 在同条件下均无单 ep 3/3。
  _Under standardized benchmark conditions (matching leaderboard protocol), ep2 placed all 3 oranges — a feat not achieved by ACT, DP, or X-VLA-15k under the same evaluation._
- **暴露了 `n_action_steps` 的关键作用**:从 default 8 改 32 是 session 中唯一可靠的 3.5× baseline 提升。
  _Exposes `n_action_steps` as the single most reliable improvement — switching from default 8 to chunk_size=32 (full chunk reuse) gave ~3.5× over baseline._
- **Weak image-aug 是唯一 aggregate 正向 retrain**:lerobot 默认 ColorJitter+Sharp+Affine 在 50-demo 数据集是 over-regularize(13% per-ep);只保留 brightness ±5%(max_num_transforms=1)反而 +5.6% 真胜 baseline,10k 达到 44% per-ep。
  _Out of 6 retrain experiments (velocity-reweight, L1 loss, default image-aug, weak image-aug, body-desc, L1+aug compound), **only weak image-aug was net positive**. Default aug strength was harmful (-11.1% vs baseline); minimal brightness-only aug at 10k step gave 44% per-ep on benchmark._

## 训练配方
_Training recipe_

```bash
# 一段式 10k step from lerobot/xvla-base
WEAK_IMAGE_AUG=1 \
BATCH_SIZE=8 \
MAX_STEPS=10000 \
SAVE_FREQ=500 \
OUTPUT_DIR=$LEISAAC/outputs/xvla-leisaac-pick-orange.weakaug \
bash LeIsaac/scripts/finetune/xvla/train.sh
```

`WEAK_IMAGE_AUG=1` 在 [`train.sh`](https://github.com/vitorcen/LeIsaac-Training/blob/main/scripts/finetune/xvla/train.sh) 内展开为:

```
--dataset.image_transforms.enable=true
--dataset.image_transforms.max_num_transforms=1
--dataset.image_transforms.tfs={"brightness":{"weight":1.0,"type":"ColorJitter","kwargs":{"brightness":[0.95,1.05]}}}
```

即:每 batch 至多采样 1 个 transform,且只允许 brightness ±5%(关闭 contrast / saturation / hue / SharpnessJitter / RandomAffine)。

详细对比见 [完整 retrain 聚合表](#完整-retrain-实验聚合表)。

## 推理 / Inference

### 端到端 server(Isaac Sim ZMQ 客户端兼容)

```bash
# 启动 X-VLA 推理服务(ZMQ REQ/REP + msgpack)
N_ACTION_STEPS=32 \
PROMPT="Pick up the orange and put it in the plate" \
CKPT=<this_repo_dir> \
PORT=5558 \
bash server/serve_xvla.sh --detach

# 在 Isaac Sim 客户端跑 PickOrange eval
POLICY_PORT=5558 \
POLICY_TIMEOUT_MS=3000 \
ACTION_HORIZON=1 \
EVAL_ROUNDS=3 \
EPISODE_LENGTH=120 \
PROMPT="Pick up the orange and put it in the plate" \
MAX_ROUND_WALL_S=180 \
bash server/eval_pi05.sh
```

[Server 实现](https://github.com/vitorcen/isaaclab-experience/blob/main/server/xvla_leisaac/server.py)。eval 脚本与 π0.5/SmolVLA/ACT/GR00T 共用。

### 🔴 推理关键配置 / Critical inference caveat

| `n_action_steps` | 6-round oranges | per-ep | 备注 |
|---|---|---|---|
| 8 (lerobot default) | **0/18** ❌ | 0% | 每步 replan,chunk[0]→chunk[0]→... 互相打架 |
| 16 | 4/18 | 22% | 部分 chunk 复用 |
| **32 (= chunk_size)** | **6/18 + 3/3 perfect** ⭐ | **33%** | 全 chunk 复用,单 chunk 自洽 |

**X-VLA 的 RF action head 一次性生成 32-step chunk,必须让 chunk 在 env 里全部展开**才能体现其规划价值。每步 re-plan 反而让 chunk 序列错位。

## 评测结果
_Evaluation_

### Benchmark-aligned (3 round × 120s sim × 180s wall_cap) — leaderboard 同条件

| Episode | oranges placed | wall time | 备注 |
|---|---|---|---|
| 1 | 1/3 | 180.1s | wall_cap |
| 2 | **3/3** ✅ | **180.0s** | **3/3 perfect** ⭐ |
| 3 | 0/3 | 180.1s | wall_cap |
| **Total** | **4/9 (44%)** | — | 0/3 strict(env 未 report done,仅放对 3 颗)|

### 6-round 扩展 eval (60s sim × 90s wall_cap)

| Episode | oranges placed | wall time |
|---|---|---|
| 1 | 1/3 | 90.0s |
| 2 | **3/3** ✅ | 90.0s |
| 3 | 0/3 | 90.0s |
| 4 | 1/3 | 90.1s |
| 5 | 0/3 | 90.0s |
| 6 | 1/3 | 90.1s |
| **Total** | **6/18 (33%)** | — |

### 完整 retrain 实验聚合表

| Retrain config (5 ckpts × 6-round = 90 ep) | per-ep aggregate | vs baseline |
|---|---|---|
| 🥇 **Weak image-aug (brightness ±5%)** | **30.0%** | **+5.6** ⭐ |
| L1 loss (OFT-lite, [Fine-Tuning VLA 2502.19645](https://arxiv.org/abs/2502.19645)) | 27.8% | +3.4 |
| Baseline (no retrain) | 24.4% | — |
| L1 + weak aug compound | 15.6% | -8.8 (负干扰) |
| Default image-aug (lerobot 默认强度) | 13.3% | -11.1 |
| Velocity-reweight β=2.0 ([AttenA+ 2605.13548](https://arxiv.org/abs/2605.13548)) | ~11% | -13 |

详见父项目 HTML 设计文档 [`vla_improvement_methods_checklist.html`](https://github.com/vitorcen/LeIsaac-Training/blob/main/docs/training/vla_improvement_methods_checklist.html)(含 90+ 个 hyperparam sweep CSV)。

## 已证伪 / 不要再试的方法
_Negative findings — DO NOT repeat_

90+ 实验中已严格证伪(≥36 ep cumulative):

-**TAE (Temporal Action Ensembling, [ALOHA 2304.13705](https://arxiv.org/abs/2304.13705))**:K∈{2,4,8} × m∈{0.1,0.3} 全部 ≤1/9。X-VLA 的 RF + 10-step denoising 本身就有平滑性。
-**EMA action smoothing α∈[0.2, 0.7]**:3-round 上 α=0.3=5/9 是单 ep outlier;12-round retest = 2/18,实际有害。
-**"Grasp" verb in prompt**:0/18 完全死掉。可能 OXE 数据集里 "grasp" 关联到 hand-pose 而非 robot reach trajectory。
-**"all <plural>" prompts**:3/18,触发多目标歧义。
-**短 prompt 缺 "Pick up" preamble**:1/18,无法 ground。
-**"on/onto the plate" 介词**:≤2/18,远不如 "in the plate"(容器语义)。
-**Body-desc retrain (Path 2)**:Florence2 freeze 下长 prompt 只是 token 微扰,不改 action conditioning。
-**Offline action-MSE eval**:不预测 closed-loop(多次证伪)。**只能 Isaac Sim 实测**-**3-round closed-loop eval**:方差 ±15-30%。**所有决策必须 ≥6-round (≥18 ep),对比必须 ≥12-round (≥36 ep)**## 限制 / Limitations

- **样本数小**:44% per-ep 是 benchmark 3-round (9 ep) 估计,置信区间宽 ±20%。6-round 扩展 = 33% (18 ep, CI ±15%)。
- **数据集只有 50 demo**:retrain 改 loss / aug 普遍过激;扩到 80-100 demo 应能突破当前 ~44% per-ep 上限。
- **place 子任务多模态**:模型偶尔抓起后悬空抖动。可能需要 DAgger 或 synthetic relabel 修 covariate shift。
- **chunk_size=32 与 wall_clock**:1 chunk = 32 step × 33ms ≈ 1s 规划周期。比 ACT (chunk=100, 3.3s 周期) 灵活但比 DP DDIM-32 慢(200ms 周期)。

## 引用 / Citations

- **X-VLA**: Zhao et al., [_X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model_](https://arxiv.org/abs/2510.10274), 2025.
- **OFT recipe (L1 loss baseline)**: [_Fine-Tuning VLA: Optimizing Speed and Success_](https://arxiv.org/abs/2502.19645), 2025.
- **LeIsaac SO-101 PickOrange**: [LightwheelAI/leisaac](https://github.com/LightwheelAI/leisaac).
- **lerobot**: [HuggingFace lerobot](https://github.com/huggingface/lerobot).

## License

Apache-2.0,与 lerobot / X-VLA-base 一致。