Robotics
LeRobot
Safetensors
English
xvla
x-vla
so101
leisaac
pick-orange
isaac-sim
rectified-flow
florence2
Instructions to use wsagi/X-VLA-PickOrange with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LeRobot
How to use wsagi/X-VLA-PickOrange with LeRobot:
- Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| library_name: lerobot | |
| pipeline_tag: robotics | |
| tags: | |
| - xvla | |
| - x-vla | |
| - lerobot | |
| - so101 | |
| - leisaac | |
| - pick-orange | |
| - isaac-sim | |
| - rectified-flow | |
| - florence2 | |
| datasets: | |
| - LightwheelAI/leisaac-pick-orange | |
| language: | |
| - en | |
| base_model: lerobot/xvla-base | |
| # X-VLA-PickOrange | |
| 针对 [LeIsaac SO-101 PickOrange](https://github.com/LightwheelAI/leisaac) 任务从 [X-VLA-base](https://huggingface.co/lerobot/xvla-base) 微调的 [X-VLA](https://arxiv.org/abs/2510.10274) (Florence2 + Soft-Prompted Transformer + Rectified-Flow action head, 0.9B params) 策略。 | |
| _An [X-VLA](https://arxiv.org/abs/2510.10274) (Florence2 + Soft-Prompted Transformer + Rectified-Flow action head, 0.9 B params) policy fine-tuned from [X-VLA-base](https://huggingface.co/lerobot/xvla-base) on the [LeIsaac SO-101 PickOrange](https://github.com/LightwheelAI/leisaac) task._ | |
|  | |
| **🔗 项目仓库 / Project repos**: | |
| - [vitorcen/isaaclab-experience](https://github.com/vitorcen/isaaclab-experience) — Isaac Lab + LeIsaac 多策略横评(parent project) | |
| - [vitorcen/LeIsaac-Training](https://github.com/vitorcen/LeIsaac-Training) — LeIsaac fork(训练脚本 + 设计文档 / training scripts + design docs) | |
| ## TL;DR | |
| - **任务 / Task**:`Pick up the orange and put it in the plate` — SO-101 单臂依次夹起 3 颗橙子并放盘子。 | |
| _Single-arm SO-101 picks 3 oranges sequentially and places each in a plate._ | |
| - **数据集 / Dataset**:[`LightwheelAI/leisaac-pick-orange`](https://huggingface.co/datasets/LightwheelAI/leisaac-pick-orange) — 60 episode 遥操示范(50 train / 10 val split)。 | |
| - **架构 / Architecture**:X-VLA — Florence2 vision-language encoder + Soft-Prompted Transformer + Rectified-Flow action head(10 denoising steps)。chunk_size=32,n_obs_steps=2。 | |
| - **训练 / Training**:batch=8 / lr=1e-4 / **10k step** / **weak image-aug (brightness ±5% only)** / GRIPPER_SCALE=5 / ~18 min on RTX 4090。 | |
| - **评测 / Eval**(benchmark-aligned 3 round × 120s sim × 180s wall_cap,与 leaderboard 其他 baseline 同条件):**4/9 oranges (44%)**,**ep2 = [T, T, T] 3/3** ⭐。 | |
| - **⚠️ 关键 inference 配置 / Critical inference setting**:`n_action_steps=32`(chunk_size 整 reuse)。 | |
| 默认 `n_action_steps=8` 在此 ckpt 上 6-round = **0/18 灾难性失败**(每步重 plan 互相冲突)。详见下方 [Inference caveat](#-推理关键配置--critical-inference-caveat)。 | |
| ## 模型亮点 | |
| _Highlights_ | |
| - **Benchmark setting (3 round × 120s sim × 180s wall_cap) 下 ep2 = 3/3 perfect 全部完成**。其他 baseline (ACT, DP, X-VLA-15k) 在同条件下均无单 ep 3/3。 | |
| _Under standardized benchmark conditions (matching leaderboard protocol), ep2 placed all 3 oranges — a feat not achieved by ACT, DP, or X-VLA-15k under the same evaluation._ | |
| - **暴露了 `n_action_steps` 的关键作用**:从 default 8 改 32 是 session 中唯一可靠的 3.5× baseline 提升。 | |
| _Exposes `n_action_steps` as the single most reliable improvement — switching from default 8 to chunk_size=32 (full chunk reuse) gave ~3.5× over baseline._ | |
| - **Weak image-aug 是唯一 aggregate 正向 retrain**:lerobot 默认 ColorJitter+Sharp+Affine 在 50-demo 数据集是 over-regularize(13% per-ep);只保留 brightness ±5%(max_num_transforms=1)反而 +5.6% 真胜 baseline,10k 达到 44% per-ep。 | |
| _Out of 6 retrain experiments (velocity-reweight, L1 loss, default image-aug, weak image-aug, body-desc, L1+aug compound), **only weak image-aug was net positive**. Default aug strength was harmful (-11.1% vs baseline); minimal brightness-only aug at 10k step gave 44% per-ep on benchmark._ | |
| ## 训练配方 | |
| _Training recipe_ | |
| ```bash | |
| # 一段式 10k step from lerobot/xvla-base | |
| WEAK_IMAGE_AUG=1 \ | |
| BATCH_SIZE=8 \ | |
| MAX_STEPS=10000 \ | |
| SAVE_FREQ=500 \ | |
| OUTPUT_DIR=$LEISAAC/outputs/xvla-leisaac-pick-orange.weakaug \ | |
| bash LeIsaac/scripts/finetune/xvla/train.sh | |
| ``` | |
| `WEAK_IMAGE_AUG=1` 在 [`train.sh`](https://github.com/vitorcen/LeIsaac-Training/blob/main/scripts/finetune/xvla/train.sh) 内展开为: | |
| ``` | |
| --dataset.image_transforms.enable=true | |
| --dataset.image_transforms.max_num_transforms=1 | |
| --dataset.image_transforms.tfs={"brightness":{"weight":1.0,"type":"ColorJitter","kwargs":{"brightness":[0.95,1.05]}}} | |
| ``` | |
| 即:每 batch 至多采样 1 个 transform,且只允许 brightness ±5%(关闭 contrast / saturation / hue / SharpnessJitter / RandomAffine)。 | |
| 详细对比见 [完整 retrain 聚合表](#完整-retrain-实验聚合表)。 | |
| ## 推理 / Inference | |
| ### 端到端 server(Isaac Sim ZMQ 客户端兼容) | |
| ```bash | |
| # 启动 X-VLA 推理服务(ZMQ REQ/REP + msgpack) | |
| N_ACTION_STEPS=32 \ | |
| PROMPT="Pick up the orange and put it in the plate" \ | |
| CKPT=<this_repo_dir> \ | |
| PORT=5558 \ | |
| bash server/serve_xvla.sh --detach | |
| # 在 Isaac Sim 客户端跑 PickOrange eval | |
| POLICY_PORT=5558 \ | |
| POLICY_TIMEOUT_MS=3000 \ | |
| ACTION_HORIZON=1 \ | |
| EVAL_ROUNDS=3 \ | |
| EPISODE_LENGTH=120 \ | |
| PROMPT="Pick up the orange and put it in the plate" \ | |
| MAX_ROUND_WALL_S=180 \ | |
| bash server/eval_pi05.sh | |
| ``` | |
| [Server 实现](https://github.com/vitorcen/isaaclab-experience/blob/main/server/xvla_leisaac/server.py)。eval 脚本与 π0.5/SmolVLA/ACT/GR00T 共用。 | |
| ### 🔴 推理关键配置 / Critical inference caveat | |
| | `n_action_steps` | 6-round oranges | per-ep | 备注 | | |
| |---|---|---|---| | |
| | 8 (lerobot default) | **0/18** ❌ | 0% | 每步 replan,chunk[0]→chunk[0]→... 互相打架 | | |
| | 16 | 4/18 | 22% | 部分 chunk 复用 | | |
| | **32 (= chunk_size)** | **6/18 + 3/3 perfect** ⭐ | **33%** | 全 chunk 复用,单 chunk 自洽 | | |
| **X-VLA 的 RF action head 一次性生成 32-step chunk,必须让 chunk 在 env 里全部展开**才能体现其规划价值。每步 re-plan 反而让 chunk 序列错位。 | |
| ## 评测结果 | |
| _Evaluation_ | |
| ### Benchmark-aligned (3 round × 120s sim × 180s wall_cap) — leaderboard 同条件 | |
| | Episode | oranges placed | wall time | 备注 | | |
| |---|---|---|---| | |
| | 1 | 1/3 | 180.1s | wall_cap | | |
| | 2 | **3/3** ✅ | **180.0s** | **3/3 perfect** ⭐ | | |
| | 3 | 0/3 | 180.1s | wall_cap | | |
| | **Total** | **4/9 (44%)** | — | 0/3 strict(env 未 report done,仅放对 3 颗)| | |
| ### 6-round 扩展 eval (60s sim × 90s wall_cap) | |
| | Episode | oranges placed | wall time | | |
| |---|---|---| | |
| | 1 | 1/3 | 90.0s | | |
| | 2 | **3/3** ✅ | 90.0s | | |
| | 3 | 0/3 | 90.0s | | |
| | 4 | 1/3 | 90.1s | | |
| | 5 | 0/3 | 90.0s | | |
| | 6 | 1/3 | 90.1s | | |
| | **Total** | **6/18 (33%)** | — | | |
| ### 完整 retrain 实验聚合表 | |
| | Retrain config (5 ckpts × 6-round = 90 ep) | per-ep aggregate | vs baseline | | |
| |---|---|---| | |
| | 🥇 **Weak image-aug (brightness ±5%)** | **30.0%** | **+5.6** ⭐ | | |
| | L1 loss (OFT-lite, [Fine-Tuning VLA 2502.19645](https://arxiv.org/abs/2502.19645)) | 27.8% | +3.4 | | |
| | Baseline (no retrain) | 24.4% | — | | |
| | L1 + weak aug compound | 15.6% | -8.8 (负干扰) | | |
| | Default image-aug (lerobot 默认强度) | 13.3% | -11.1 | | |
| | Velocity-reweight β=2.0 ([AttenA+ 2605.13548](https://arxiv.org/abs/2605.13548)) | ~11% | -13 | | |
| 详见父项目 HTML 设计文档 [`vla_improvement_methods_checklist.html`](https://github.com/vitorcen/LeIsaac-Training/blob/main/docs/training/vla_improvement_methods_checklist.html)(含 90+ 个 hyperparam sweep CSV)。 | |
| ## 已证伪 / 不要再试的方法 | |
| _Negative findings — DO NOT repeat_ | |
| 90+ 实验中已严格证伪(≥36 ep cumulative): | |
| - ❌ **TAE (Temporal Action Ensembling, [ALOHA 2304.13705](https://arxiv.org/abs/2304.13705))**:K∈{2,4,8} × m∈{0.1,0.3} 全部 ≤1/9。X-VLA 的 RF + 10-step denoising 本身就有平滑性。 | |
| - ❌ **EMA action smoothing α∈[0.2, 0.7]**:3-round 上 α=0.3=5/9 是单 ep outlier;12-round retest = 2/18,实际有害。 | |
| - ❌ **"Grasp" verb in prompt**:0/18 完全死掉。可能 OXE 数据集里 "grasp" 关联到 hand-pose 而非 robot reach trajectory。 | |
| - ❌ **"all <plural>" prompts**:3/18,触发多目标歧义。 | |
| - ❌ **短 prompt 缺 "Pick up" preamble**:1/18,无法 ground。 | |
| - ❌ **"on/onto the plate" 介词**:≤2/18,远不如 "in the plate"(容器语义)。 | |
| - ❌ **Body-desc retrain (Path 2)**:Florence2 freeze 下长 prompt 只是 token 微扰,不改 action conditioning。 | |
| - ❌ **Offline action-MSE eval**:不预测 closed-loop(多次证伪)。**只能 Isaac Sim 实测**。 | |
| - ❌ **3-round closed-loop eval**:方差 ±15-30%。**所有决策必须 ≥6-round (≥18 ep),对比必须 ≥12-round (≥36 ep)**。 | |
| ## 限制 / Limitations | |
| - **样本数小**:44% per-ep 是 benchmark 3-round (9 ep) 估计,置信区间宽 ±20%。6-round 扩展 = 33% (18 ep, CI ±15%)。 | |
| - **数据集只有 50 demo**:retrain 改 loss / aug 普遍过激;扩到 80-100 demo 应能突破当前 ~44% per-ep 上限。 | |
| - **place 子任务多模态**:模型偶尔抓起后悬空抖动。可能需要 DAgger 或 synthetic relabel 修 covariate shift。 | |
| - **chunk_size=32 与 wall_clock**:1 chunk = 32 step × 33ms ≈ 1s 规划周期。比 ACT (chunk=100, 3.3s 周期) 灵活但比 DP DDIM-32 慢(200ms 周期)。 | |
| ## 引用 / Citations | |
| - **X-VLA**: Zhao et al., [_X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model_](https://arxiv.org/abs/2510.10274), 2025. | |
| - **OFT recipe (L1 loss baseline)**: [_Fine-Tuning VLA: Optimizing Speed and Success_](https://arxiv.org/abs/2502.19645), 2025. | |
| - **LeIsaac SO-101 PickOrange**: [LightwheelAI/leisaac](https://github.com/LightwheelAI/leisaac). | |
| - **lerobot**: [HuggingFace lerobot](https://github.com/huggingface/lerobot). | |
| ## License | |
| Apache-2.0,与 lerobot / X-VLA-base 一致。 | |