Add files using upload-large-folder tool

Browse files

Files changed (4) hide show

README.md +31 -31
config.json +2 -2
model.safetensors +1 -1
train_config.json +5 -5

README.md CHANGED Viewed

@@ -36,41 +36,32 @@ _An [X-VLA](https://arxiv.org/abs/2510.10274) (Florence2 + Soft-Prompted Transfo
   _Single-arm SO-101 picks 3 oranges sequentially and places each in a plate._
 - **数据集 / Dataset**：[`LightwheelAI/leisaac-pick-orange`](https://huggingface.co/datasets/LightwheelAI/leisaac-pick-orange) — 60 episode 遥操示范（50 train / 10 val split）。
 - **架构 / Architecture**：X-VLA — Florence2 vision-language encoder + Soft-Prompted Transformer + Rectified-Flow action head（10 denoising steps）。chunk_size=32，n_obs_steps=2。
-- **训练 / Training**：batch=8 / lr=1e-4 / **15k step**（10k 从 base + 5k resume）/ **weak image-aug (brightness ±5% only)** / GRIPPER_SCALE=5 / ~26 min on RTX 4090。
-- **评测 / Eval**：Isaac Sim 5.1 + LeIsaac，6 round × 60s。**7/18 oranges 放置 + 1/6 formal env success**，best 单 ep = **3/3 全部完成 @ 31s** (32% 时间预算)。
 - **⚠️ 关键 inference 配置 / Critical inference setting**：`n_action_steps=32`（chunk_size 整 reuse）。
   默认 `n_action_steps=8` 在此 ckpt 上 6-round = **0/18 灾难性失败**（每步重 plan 互相冲突）。详见下方 [Inference caveat](#-推理关键配置--critical-inference-caveat)。
 ## 模型亮点
 _Highlights_
-- **session 首次单 ep formal Isaac Sim env success + 31s 完成**：ep1 = `[True, True, True] @ 31.0s`。其他 baseline / retrain 配置整 90+ 实验从未达到 env 端确认 task done。
-  _First single-episode formal Isaac Sim env-side task completion (env reports `done=True`) in our 90+ hyperparam sweep — ep1 placed all 3 oranges in 31.0s, well under the 90s wall cap._
 - **暴露了 `n_action_steps` 的关键作用**：从 default 8 改 32 是 session 中唯一可靠的 3.5× baseline 提升。
   _Exposes `n_action_steps` as the single most reliable improvement — switching from default 8 to chunk_size=32 (full chunk reuse) gave ~3.5× over baseline._
-- **Weak image-aug 是唯一 aggregate 正向 retrain**：lerobot 默认 ColorJitter+Sharp+Affine 在 50-demo 数据集是 over-regularize（13% per-ep）；只保留 brightness ±5%（max_num_transforms=1）反而 +5.6% 真胜 baseline。
-  _Out of 6 retrain experiments (velocity-reweight, L1 loss, default image-aug, weak image-aug, body-desc, L1+aug compound), **only weak image-aug was net positive**. Default aug strength was harmful (-11.1% vs baseline); minimal brightness-only aug + 15k step gave the result above._
 ## 训练配方
 _Training recipe_
 ```bash
-# 1. 第一段 10k step from lerobot/xvla-base
 WEAK_IMAGE_AUG=1 \
 BATCH_SIZE=8 \
 MAX_STEPS=10000 \
 SAVE_FREQ=500 \
 OUTPUT_DIR=$LEISAAC/outputs/xvla-leisaac-pick-orange.weakaug \
 bash LeIsaac/scripts/finetune/xvla/train.sh
-# 2. 第二段 resume 10k → 15k（关键，10k 时刻还未出现 formal success）
-WEAK_IMAGE_AUG=1 \
-BATCH_SIZE=8 \
-MAX_STEPS=15000 \
-SAVE_FREQ=500 \
-RESUME=true \
-OUTPUT_DIR=$LEISAAC/outputs/xvla-leisaac-pick-orange.weakaug \
-bash LeIsaac/scripts/finetune/xvla/train.sh
 ```
 `WEAK_IMAGE_AUG=1` 在 [`train.sh`](https://github.com/vitorcen/LeIsaac-Training/blob/main/scripts/finetune/xvla/train.sh) 内展开为：
@@ -101,10 +92,10 @@ bash server/serve_xvla.sh --detach
 POLICY_PORT=5558 \
 POLICY_TIMEOUT_MS=3000 \
 ACTION_HORIZON=1 \
-EVAL_ROUNDS=6 \
-EPISODE_LENGTH=60 \
 PROMPT="Pick up the orange and put it in the plate" \
-MAX_ROUND_WALL_S=90 \
 bash server/eval_pi05.sh
 ```
@@ -116,24 +107,33 @@ bash server/eval_pi05.sh
 |---|---|---|---|
 | 8 (lerobot default) | **0/18** ❌ | 0% | 每步 replan，chunk[0]→chunk[0]→... 互相打架 |
 | 16 | 4/18 | 22% | 部分 chunk 复用 |
-| **32 (= chunk_size)** | **7/18 + 1 success** ⭐ | **39%** | 全 chunk 复用，单 chunk 自洽 |
 **X-VLA 的 RF action head 一次性生成 32-step chunk，必须让 chunk 在 env 里全部展开**才能体现其规划价值。每步 re-plan 反而让 chunk 序列错位。
 ## 评测结果
 _Evaluation_
-6 round × 60s × Isaac Sim PickOrange (`LeIsaac-SO101-PickOrange-v0`)，`max_round_wall_s=90`：
 | Episode | oranges placed | wall time | 备注 |
 |---|---|---|---|
-| 1 | **3/3** ✅ | **31.0s** | **formal env success** ⭐⭐⭐ |
-| 2 | 2/3 | 90.0s | wall_cap |
-| 3 | 0/3 | 90.1s | wall_cap |
-| 4 | 2/3 | 90.0s | wall_cap |
-| 5 | 0/3 | 90.0s | wall_cap |
-| 6 | 0/3 | 90.0s | wall_cap |
-| **Total** | **7/18 (39%)** | — | **1/6 success rate** |
 ### 完整 retrain 实验聚合表
@@ -165,9 +165,9 @@ _Negative findings — DO NOT repeat_
 ## 限制 / Limitations
-- **样本数小**：~39% per-ep 是 6-round (18 ep) 估计，置信区间 ±15%。15k 的 `1/6 formal success + 3/3@31s` 是单 ep 突破，未做 12-round 复测以确认稳定性。Replicate 看是否能保持。
-- **数据集只有 50 demo**：retrain 改 loss / aug 普遍过激；扩到 80-100 demo 应能突破当前 39% 上限。
-- **place 子任务多模态**：模型偶尔抓起后悬空抖动（中间 ckpts 14/13k 频发）。可能需要 DAgger 或 synthetic relabel 修 covariate shift。
 - **chunk_size=32 与 wall_clock**：1 chunk = 32 step × 33ms ≈ 1s 规划周期。比 ACT (chunk=100, 3.3s 周期) 灵活但比 DP DDIM-32 慢（200ms 周期）。
 ## 引用 / Citations

   _Single-arm SO-101 picks 3 oranges sequentially and places each in a plate._
 - **数据集 / Dataset**：[`LightwheelAI/leisaac-pick-orange`](https://huggingface.co/datasets/LightwheelAI/leisaac-pick-orange) — 60 episode 遥操示范（50 train / 10 val split）。
 - **架构 / Architecture**：X-VLA — Florence2 vision-language encoder + Soft-Prompted Transformer + Rectified-Flow action head（10 denoising steps）。chunk_size=32，n_obs_steps=2。
+- **训练 / Training**：batch=8 / lr=1e-4 / **10k step** / **weak image-aug (brightness ±5% only)** / GRIPPER_SCALE=5 / ~18 min on RTX 4090。
+- **评测 / Eval**（benchmark-aligned 3 round × 120s sim × 180s wall_cap，与 leaderboard 其他 baseline 同条件）：**4/9 oranges (44%)**，**ep2 = [T, T, T] 3/3** ⭐。
 - **⚠️ 关键 inference 配置 / Critical inference setting**：`n_action_steps=32`（chunk_size 整 reuse）。
   默认 `n_action_steps=8` 在此 ckpt 上 6-round = **0/18 灾难性失败**（每步重 plan 互相冲突）。详见下方 [Inference caveat](#-推理关键配置--critical-inference-caveat)。
 ## 模型亮点
 _Highlights_
+- **Benchmark setting (3 round × 120s sim × 180s wall_cap) 下 ep2 = 3/3 perfect 全部完成**。其他 baseline (ACT, DP, X-VLA-15k) 在同条件下均无单 ep 3/3。
+  _Under standardized benchmark conditions (matching leaderboard protocol), ep2 placed all 3 oranges — a feat not achieved by ACT, DP, or X-VLA-15k under the same evaluation._
 - **暴露了 `n_action_steps` 的关键作用**：从 default 8 改 32 是 session 中唯一可靠的 3.5× baseline 提升。
   _Exposes `n_action_steps` as the single most reliable improvement — switching from default 8 to chunk_size=32 (full chunk reuse) gave ~3.5× over baseline._
+- **Weak image-aug 是唯一 aggregate 正向 retrain**：lerobot 默认 ColorJitter+Sharp+Affine 在 50-demo 数据集是 over-regularize（13% per-ep）；只保留 brightness ±5%（max_num_transforms=1）反而 +5.6% 真胜 baseline，10k 达到 44% per-ep。
+  _Out of 6 retrain experiments (velocity-reweight, L1 loss, default image-aug, weak image-aug, body-desc, L1+aug compound), **only weak image-aug was net positive**. Default aug strength was harmful (-11.1% vs baseline); minimal brightness-only aug at 10k step gave 44% per-ep on benchmark._
 ## 训练配方
 _Training recipe_
 ```bash
+# 一段式 10k step from lerobot/xvla-base
 WEAK_IMAGE_AUG=1 \
 BATCH_SIZE=8 \
 MAX_STEPS=10000 \
 SAVE_FREQ=500 \
 OUTPUT_DIR=$LEISAAC/outputs/xvla-leisaac-pick-orange.weakaug \
 bash LeIsaac/scripts/finetune/xvla/train.sh
 ```
 `WEAK_IMAGE_AUG=1` 在 [`train.sh`](https://github.com/vitorcen/LeIsaac-Training/blob/main/scripts/finetune/xvla/train.sh) 内展开为：
 POLICY_PORT=5558 \
 POLICY_TIMEOUT_MS=3000 \
 ACTION_HORIZON=1 \
+EVAL_ROUNDS=3 \
+EPISODE_LENGTH=120 \
 PROMPT="Pick up the orange and put it in the plate" \
+MAX_ROUND_WALL_S=180 \
 bash server/eval_pi05.sh
 ```
 |---|---|---|---|
 | 8 (lerobot default) | **0/18** ❌ | 0% | 每步 replan，chunk[0]→chunk[0]→... 互相打架 |
 | 16 | 4/18 | 22% | 部分 chunk 复用 |
+| **32 (= chunk_size)** | **6/18 + 3/3 perfect** ⭐ | **33%** | 全 chunk 复用，单 chunk 自洽 |
 **X-VLA 的 RF action head 一次性生成 32-step chunk，必须让 chunk 在 env 里全部展开**才能体现其规划价值。每步 re-plan 反而让 chunk 序列错位。
 ## 评测结果
 _Evaluation_
+### Benchmark-aligned (3 round × 120s sim × 180s wall_cap) — leaderboard 同条件
 | Episode | oranges placed | wall time | 备注 |
 |---|---|---|---|
+| 1 | 1/3 | 180.1s | wall_cap |
+| 2 | **3/3** ✅ | **180.0s** | **3/3 perfect** ⭐ |
+| 3 | 0/3 | 180.1s | wall_cap |
+| **Total** | **4/9 (44%)** | — | 0/3 strict（env 未 report done，仅放对 3 颗）|
+### 6-round 扩展 eval (60s sim × 90s wall_cap)
+| Episode | oranges placed | wall time |
+|---|---|---|
+| 1 | 1/3 | 90.0s |
+| 2 | **3/3** ✅ | 90.0s |
+| 3 | 0/3 | 90.0s |
+| 4 | 1/3 | 90.1s |
+| 5 | 0/3 | 90.0s |
+| 6 | 1/3 | 90.1s |
+| **Total** | **6/18 (33%)** | — |
 ### 完整 retrain 实验聚合表
 ## 限制 / Limitations
+- **样本数小**：44% per-ep 是 benchmark 3-round (9 ep) 估计，置信区间宽 ±20%。6-round 扩展 = 33% (18 ep, CI ±15%)。
+- **数据集只有 50 demo**：retrain 改 loss / aug 普遍过激；扩到 80-100 demo 应能突破当前 ~44% per-ep 上限。
+- **place 子任务多模态**：模型偶尔抓起后悬空抖动。可能需要 DAgger 或 synthetic relabel 修 covariate shift。
 - **chunk_size=32 与 wall_clock**：1 chunk = 32 step × 33ms ≈ 1s 规划周期。比 ACT (chunk=100, 3.3s 周期) 灵活但比 DP DDIM-32 慢（200ms 周期）。
 ## 引用 / Citations

config.json CHANGED Viewed

@@ -57,7 +57,7 @@
     "private": null,
     "tags": null,
     "license": null,
-    "pretrained_path": "/home/david/work/isaaclab-experience/LeIsaac/outputs/xvla-leisaac-pick-orange.weakaug/checkpoints/last/pretrained_model",
     "chunk_size": 32,
     "n_action_steps": 8,
     "dtype": "bfloat16",
@@ -208,7 +208,7 @@
         224,
         224
     ],
-    "num_image_views": 5,
     "empty_cameras": 1,
     "freeze_vision_encoder": true,
     "freeze_language_encoder": true,

     "private": null,
     "tags": null,
     "license": null,
+    "pretrained_path": "lerobot/xvla-base",
     "chunk_size": 32,
     "n_action_steps": 8,
     "dtype": "bfloat16",
         224,
         224
     ],
+    "num_image_views": 4,
     "empty_cameras": 1,
     "freeze_vision_encoder": true,
     "freeze_language_encoder": true,

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:5f83abe2909f4110de3c14a852948af9a2c3d32813af00e6cb28c91e0ed9c6d4
 size 1759596986

 version https://git-lfs.github.com/spec/v1
+oid sha256:a2142145a9c4fe618dd35ceaadba6b7e6be3391f2b09379d2d553efe951d718d
 size 1759596986

train_config.json CHANGED Viewed

@@ -137,7 +137,7 @@
         "private": null,
         "tags": null,
         "license": null,
-        "pretrained_path": "/home/david/work/isaaclab-experience/LeIsaac/outputs/xvla-leisaac-pick-orange.weakaug/checkpoints/last/pretrained_model",
         "chunk_size": 32,
         "n_action_steps": 8,
         "dtype": "bfloat16",
@@ -288,7 +288,7 @@
             224,
             224
         ],
-        "num_image_views": 5,
         "empty_cameras": 1,
         "freeze_vision_encoder": true,
         "freeze_language_encoder": true,
@@ -311,14 +311,14 @@
     "reward_model": null,
     "output_dir": "/home/david/work/isaaclab-experience/LeIsaac/outputs/xvla-leisaac-pick-orange.weakaug",
     "job_name": "xvla",
-    "resume": true,
     "seed": 1000,
     "cudnn_deterministic": false,
     "num_workers": 4,
     "batch_size": 8,
     "prefetch_factor": 4,
     "persistent_workers": true,
-    "steps": 15000,
     "eval_freq": 20000,
     "log_freq": 200,
     "tolerance_s": 0.0001,
@@ -366,5 +366,5 @@
         "observation.images.front": "observation.images.image",
         "observation.images.wrist": "observation.images.image2"
     },
-    "checkpoint_path": "/home/david/work/isaaclab-experience/LeIsaac/outputs/xvla-leisaac-pick-orange.weakaug/checkpoints/last"
 }

         "private": null,
         "tags": null,
         "license": null,
+        "pretrained_path": "lerobot/xvla-base",
         "chunk_size": 32,
         "n_action_steps": 8,
         "dtype": "bfloat16",
             224,
             224
         ],
+        "num_image_views": 4,
         "empty_cameras": 1,
         "freeze_vision_encoder": true,
         "freeze_language_encoder": true,
     "reward_model": null,
     "output_dir": "/home/david/work/isaaclab-experience/LeIsaac/outputs/xvla-leisaac-pick-orange.weakaug",
     "job_name": "xvla",
+    "resume": false,
     "seed": 1000,
     "cudnn_deterministic": false,
     "num_workers": 4,
     "batch_size": 8,
     "prefetch_factor": 4,
     "persistent_workers": true,
+    "steps": 10000,
     "eval_freq": 20000,
     "log_freq": 200,
     "tolerance_s": 0.0001,
         "observation.images.front": "observation.images.image",
         "observation.images.wrist": "observation.images.image2"
     },
+    "checkpoint_path": null
 }