Add files using upload-large-folder tool

96def2f verified 1 day ago

9.68 kB

	---
	license: apache-2.0
	library_name: lerobot
	pipeline_tag: robotics
	tags:
	- xvla
	- x-vla
	- lerobot
	- so101
	- leisaac
	- pick-orange
	- isaac-sim
	- rectified-flow
	- florence2
	datasets:
	- LightwheelAI/leisaac-pick-orange
	language:
	- en
	base_model: lerobot/xvla-base
	---

	# X-VLA-PickOrange

	针对 [LeIsaac SO-101 PickOrange](https://github.com/LightwheelAI/leisaac) 任务从 [X-VLA-base](https://huggingface.co/lerobot/xvla-base) 微调的 [X-VLA](https://arxiv.org/abs/2510.10274) (Florence2 + Soft-Prompted Transformer + Rectified-Flow action head, 0.9B params) 策略。
	_An [X-VLA](https://arxiv.org/abs/2510.10274) (Florence2 + Soft-Prompted Transformer + Rectified-Flow action head, 0.9 B params) policy fine-tuned from [X-VLA-base](https://huggingface.co/lerobot/xvla-base) on the [LeIsaac SO-101 PickOrange](https://github.com/LightwheelAI/leisaac) task._

	![X-VLA-PickOrange — SO-101 in Isaac Sim](x-vla-pick-orange.jpg)

	🔗 项目仓库 / Project repos：
	- [vitorcen/isaaclab-experience](https://github.com/vitorcen/isaaclab-experience) — Isaac Lab + LeIsaac 多策略横评（parent project）
	- [vitorcen/LeIsaac-Training](https://github.com/vitorcen/LeIsaac-Training) — LeIsaac fork（训练脚本 + 设计文档 / training scripts + design docs）

	## TL;DR

	- 任务 / Task：`Pick up the orange and put it in the plate` — SO-101 单臂依次夹起 3 颗橙子并放盘子。
	_Single-arm SO-101 picks 3 oranges sequentially and places each in a plate._
	- 数据集 / Dataset：[`LightwheelAI/leisaac-pick-orange`](https://huggingface.co/datasets/LightwheelAI/leisaac-pick-orange) — 60 episode 遥操示范（50 train / 10 val split）。
	- 架构 / Architecture：X-VLA — Florence2 vision-language encoder + Soft-Prompted Transformer + Rectified-Flow action head（10 denoising steps）。chunk_size=32，n_obs_steps=2。
	- 训练 / Training：batch=8 / lr=1e-4 / 10k step / weak image-aug (brightness ±5% only) / GRIPPER_SCALE=5 / ~18 min on RTX 4090。
	- 评测 / Eval（benchmark-aligned 3 round × 120s sim × 180s wall_cap，与 leaderboard 其他 baseline 同条件）：4/9 oranges (44%)，ep2 = [T, T, T] 3/3 ⭐。
	- ⚠️ 关键 inference 配置 / Critical inference setting：`n_action_steps=32`（chunk_size 整 reuse）。
	默认 `n_action_steps=8` 在此 ckpt 上 6-round = 0/18 灾难性失败（每步重 plan 互相冲突）。详见下方 [Inference caveat](#-推理关键配置--critical-inference-caveat)。

	## 模型亮点
	_Highlights_

	- Benchmark setting (3 round × 120s sim × 180s wall_cap) 下 ep2 = 3/3 perfect 全部完成。其他 baseline (ACT, DP, X-VLA-15k) 在同条件下均无单 ep 3/3。
	_Under standardized benchmark conditions (matching leaderboard protocol), ep2 placed all 3 oranges — a feat not achieved by ACT, DP, or X-VLA-15k under the same evaluation._
	- 暴露了 `n_action_steps` 的关键作用：从 default 8 改 32 是 session 中唯一可靠的 3.5× baseline 提升。
	_Exposes `n_action_steps` as the single most reliable improvement — switching from default 8 to chunk_size=32 (full chunk reuse) gave ~3.5× over baseline._
	- Weak image-aug 是唯一 aggregate 正向 retrain：lerobot 默认 ColorJitter+Sharp+Affine 在 50-demo 数据集是 over-regularize（13% per-ep）；只保留 brightness ±5%（max_num_transforms=1）反而 +5.6% 真胜 baseline，10k 达到 44% per-ep。
	_Out of 6 retrain experiments (velocity-reweight, L1 loss, default image-aug, weak image-aug, body-desc, L1+aug compound), only weak image-aug was net positive. Default aug strength was harmful (-11.1% vs baseline); minimal brightness-only aug at 10k step gave 44% per-ep on benchmark._

	## 训练配方
	_Training recipe_

	```bash
	# 一段式 10k step from lerobot/xvla-base
	WEAK_IMAGE_AUG=1 \
	BATCH_SIZE=8 \
	MAX_STEPS=10000 \
	SAVE_FREQ=500 \
	OUTPUT_DIR=$LEISAAC/outputs/xvla-leisaac-pick-orange.weakaug \
	bash LeIsaac/scripts/finetune/xvla/train.sh
	```

	`WEAK_IMAGE_AUG=1` 在 [`train.sh`](https://github.com/vitorcen/LeIsaac-Training/blob/main/scripts/finetune/xvla/train.sh) 内展开为：

	```
	--dataset.image_transforms.enable=true
	--dataset.image_transforms.max_num_transforms=1
	--dataset.image_transforms.tfs={"brightness":{"weight":1.0,"type":"ColorJitter","kwargs":{"brightness":[0.95,1.05]}}}
	```

	即：每 batch 至多采样 1 个 transform，且只允许 brightness ±5%（关闭 contrast / saturation / hue / SharpnessJitter / RandomAffine）。

	详细对比见 [完整 retrain 聚合表](#完整-retrain-实验聚合表)。

	## 推理 / Inference

	### 端到端 server（Isaac Sim ZMQ 客户端兼容）

	```bash
	# 启动 X-VLA 推理服务（ZMQ REQ/REP + msgpack）
	N_ACTION_STEPS=32 \
	PROMPT="Pick up the orange and put it in the plate" \
	CKPT=<this_repo_dir> \
	PORT=5558 \
	bash server/serve_xvla.sh --detach

	# 在 Isaac Sim 客户端跑 PickOrange eval
	POLICY_PORT=5558 \
	POLICY_TIMEOUT_MS=3000 \
	ACTION_HORIZON=1 \
	EVAL_ROUNDS=3 \
	EPISODE_LENGTH=120 \
	PROMPT="Pick up the orange and put it in the plate" \
	MAX_ROUND_WALL_S=180 \
	bash server/eval_pi05.sh
	```

	[Server 实现](https://github.com/vitorcen/isaaclab-experience/blob/main/server/xvla_leisaac/server.py)。eval 脚本与 π0.5/SmolVLA/ACT/GR00T 共用。

	### 🔴 推理关键配置 / Critical inference caveat

	\| `n_action_steps` \| 6-round oranges \| per-ep \| 备注 \|
	\|---\|---\|---\|---\|
	\| 8 (lerobot default) \| 0/18 ❌ \| 0% \| 每步 replan，chunk[0]→chunk[0]→... 互相打架 \|
	\| 16 \| 4/18 \| 22% \| 部分 chunk 复用 \|
	\| 32 (= chunk_size) \| 6/18 + 3/3 perfect ⭐ \| 33% \| 全 chunk 复用，单 chunk 自洽 \|

	X-VLA 的 RF action head 一次性生成 32-step chunk，必须让 chunk 在 env 里全部展开才能体现其规划价值。每步 re-plan 反而让 chunk 序列错位。

	## 评测结果
	_Evaluation_

	### Benchmark-aligned (3 round × 120s sim × 180s wall_cap) — leaderboard 同条件

	\| Episode \| oranges placed \| wall time \| 备注 \|
	\|---\|---\|---\|---\|
	\| 1 \| 1/3 \| 180.1s \| wall_cap \|
	\| 2 \| 3/3 ✅ \| 180.0s \| 3/3 perfect ⭐ \|
	\| 3 \| 0/3 \| 180.1s \| wall_cap \|
	\| Total \| 4/9 (44%) \| — \| 0/3 strict（env 未 report done，仅放对 3 颗）\|

	### 6-round 扩展 eval (60s sim × 90s wall_cap)

	\| Episode \| oranges placed \| wall time \|
	\|---\|---\|---\|
	\| 1 \| 1/3 \| 90.0s \|
	\| 2 \| 3/3 ✅ \| 90.0s \|
	\| 3 \| 0/3 \| 90.0s \|
	\| 4 \| 1/3 \| 90.1s \|
	\| 5 \| 0/3 \| 90.0s \|
	\| 6 \| 1/3 \| 90.1s \|
	\| Total \| 6/18 (33%) \| — \|

	### 完整 retrain 实验聚合表

	\| Retrain config (5 ckpts × 6-round = 90 ep) \| per-ep aggregate \| vs baseline \|
	\|---\|---\|---\|
	\| 🥇 Weak image-aug (brightness ±5%) \| 30.0% \| +5.6 ⭐ \|
	\| L1 loss (OFT-lite, [Fine-Tuning VLA 2502.19645](https://arxiv.org/abs/2502.19645)) \| 27.8% \| +3.4 \|
	\| Baseline (no retrain) \| 24.4% \| — \|
	\| L1 + weak aug compound \| 15.6% \| -8.8 (负干扰) \|
	\| Default image-aug (lerobot 默认强度) \| 13.3% \| -11.1 \|
	\| Velocity-reweight β=2.0 ([AttenA+ 2605.13548](https://arxiv.org/abs/2605.13548)) \| ~11% \| -13 \|

	详见父项目 HTML 设计文档 [`vla_improvement_methods_checklist.html`](https://github.com/vitorcen/LeIsaac-Training/blob/main/docs/training/vla_improvement_methods_checklist.html)（含 90+ 个 hyperparam sweep CSV）。

	## 已证伪 / 不要再试的方法
	_Negative findings — DO NOT repeat_

	90+ 实验中已严格证伪（≥36 ep cumulative）：

	- ❌ TAE (Temporal Action Ensembling, [ALOHA 2304.13705](https://arxiv.org/abs/2304.13705))：K∈{2,4,8} × m∈{0.1,0.3} 全部 ≤1/9。X-VLA 的 RF + 10-step denoising 本身就有平滑性。
	- ❌ EMA action smoothing α∈[0.2, 0.7]：3-round 上 α=0.3=5/9 是单 ep outlier；12-round retest = 2/18，实际有害。
	- ❌ "Grasp" verb in prompt：0/18 完全死掉。可能 OXE 数据集里 "grasp" 关联到 hand-pose 而非 robot reach trajectory。
	- ❌ "all <plural>" prompts：3/18，触发多目标歧义。
	- ❌ 短 prompt 缺 "Pick up" preamble：1/18，无法 ground。
	- ❌ "on/onto the plate" 介词：≤2/18，远不如 "in the plate"（容器语义）。
	- ❌ Body-desc retrain (Path 2)：Florence2 freeze 下长 prompt 只是 token 微扰，不改 action conditioning。
	- ❌ Offline action-MSE eval：不预测 closed-loop（多次证伪）。只能 Isaac Sim 实测。
	- ❌ 3-round closed-loop eval：方差 ±15-30%。所有决策必须 ≥6-round (≥18 ep)，对比必须 ≥12-round (≥36 ep)。

	## 限制 / Limitations

	- 样本数小：44% per-ep 是 benchmark 3-round (9 ep) 估计，置信区间宽 ±20%。6-round 扩展 = 33% (18 ep, CI ±15%)。
	- 数据集只有 50 demo：retrain 改 loss / aug 普遍过激；扩到 80-100 demo 应能突破当前 ~44% per-ep 上限。
	- place 子任务多模态：模型偶尔抓起后悬空抖动。可能需要 DAgger 或 synthetic relabel 修 covariate shift。
	- chunk_size=32 与 wall_clock：1 chunk = 32 step × 33ms ≈ 1s 规划周期。比 ACT (chunk=100, 3.3s 周期) 灵活但比 DP DDIM-32 慢（200ms 周期）。

	## 引用 / Citations

	- X-VLA: Zhao et al., [_X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model_](https://arxiv.org/abs/2510.10274), 2025.
	- OFT recipe (L1 loss baseline): [_Fine-Tuning VLA: Optimizing Speed and Success_](https://arxiv.org/abs/2502.19645), 2025.
	- LeIsaac SO-101 PickOrange: [LightwheelAI/leisaac](https://github.com/LightwheelAI/leisaac).
	- lerobot: [HuggingFace lerobot](https://github.com/huggingface/lerobot).

	## License

	Apache-2.0，与 lerobot / X-VLA-base 一致。