wsagi commited on
Commit
018a3a4
·
verified ·
1 Parent(s): ee0f676

Add files using upload-large-folder tool

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ x-vla-pick-orange.jpg filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,182 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: lerobot
4
+ pipeline_tag: robotics
5
+ tags:
6
+ - xvla
7
+ - x-vla
8
+ - lerobot
9
+ - so101
10
+ - leisaac
11
+ - pick-orange
12
+ - isaac-sim
13
+ - rectified-flow
14
+ - florence2
15
+ datasets:
16
+ - LightwheelAI/leisaac-pick-orange
17
+ language:
18
+ - en
19
+ base_model: lerobot/xvla-base
20
+ ---
21
+
22
+ # X-VLA-PickOrange
23
+
24
+ 针对 [LeIsaac SO-101 PickOrange](https://github.com/LightwheelAI/leisaac) 任务从 [X-VLA-base](https://huggingface.co/lerobot/xvla-base) 微调的 [X-VLA](https://arxiv.org/abs/2510.10274) (Florence2 + Soft-Prompted Transformer + Rectified-Flow action head, 0.9B params) 策略。
25
+ _An [X-VLA](https://arxiv.org/abs/2510.10274) (Florence2 + Soft-Prompted Transformer + Rectified-Flow action head, 0.9 B params) policy fine-tuned from [X-VLA-base](https://huggingface.co/lerobot/xvla-base) on the [LeIsaac SO-101 PickOrange](https://github.com/LightwheelAI/leisaac) task._
26
+
27
+ ![X-VLA-PickOrange — SO-101 in Isaac Sim](x-vla-pick-orange.jpg)
28
+
29
+ **🔗 项目仓库 / Project repos**:
30
+ - [vitorcen/isaaclab-experience](https://github.com/vitorcen/isaaclab-experience) — Isaac Lab + LeIsaac 多策略横评(parent project)
31
+ - [vitorcen/LeIsaac-Training](https://github.com/vitorcen/LeIsaac-Training) — LeIsaac fork(训练脚本 + 设计文档 / training scripts + design docs)
32
+
33
+ ## TL;DR
34
+
35
+ - **任务 / Task**:`Pick up the orange and put it in the plate` — SO-101 单臂依次夹起 3 颗橙子并放盘子。
36
+ _Single-arm SO-101 picks 3 oranges sequentially and places each in a plate._
37
+ - **数据集 / Dataset**:[`LightwheelAI/leisaac-pick-orange`](https://huggingface.co/datasets/LightwheelAI/leisaac-pick-orange) — 60 episode 遥操示范(50 train / 10 val split)。
38
+ - **架构 / Architecture**:X-VLA — Florence2 vision-language encoder + Soft-Prompted Transformer + Rectified-Flow action head(10 denoising steps)。chunk_size=32,n_obs_steps=2。
39
+ - **训练 / Training**:batch=8 / lr=1e-4 / **15k step**(10k 从 base + 5k resume)/ **weak image-aug (brightness ±5% only)** / GRIPPER_SCALE=5 / ~26 min on RTX 4090。
40
+ - **评测 / Eval**:Isaac Sim 5.1 + LeIsaac,6 round × 60s。**7/18 oranges 放置 + 1/6 formal env success**,best 单 ep = **3/3 全部完成 @ 31s** (32% 时间预算)。
41
+ - **⚠️ 关键 inference 配置 / Critical inference setting**:`n_action_steps=32`(chunk_size 整 reuse)。
42
+ 默认 `n_action_steps=8` 在此 ckpt 上 6-round = **0/18 灾难性失败**(每步重 plan 互相冲突)。详见下方 [Inference caveat](#-推理关键配置--critical-inference-caveat)。
43
+
44
+ ## 模型亮点
45
+ _Highlights_
46
+
47
+ - **session 首次单 ep formal Isaac Sim env success + 31s 完成**:ep1 = `[True, True, True] @ 31.0s`。其他 baseline / retrain 配置整 90+ 实验从未达到 env 端确认 task done。
48
+ _First single-episode formal Isaac Sim env-side task completion (env reports `done=True`) in our 90+ hyperparam sweep — ep1 placed all 3 oranges in 31.0s, well under the 90s wall cap._
49
+ - **暴露了 `n_action_steps` 的关键作用**:从 default 8 改 32 是 session 中唯一可靠的 3.5× baseline 提升。
50
+ _Exposes `n_action_steps` as the single most reliable improvement — switching from default 8 to chunk_size=32 (full chunk reuse) gave ~3.5× over baseline._
51
+ - **Weak image-aug 是唯一 aggregate 正向 retrain**:lerobot 默认 ColorJitter+Sharp+Affine 在 50-demo 数据集是 over-regularize(13% per-ep);只保留 brightness ±5%(max_num_transforms=1)反而 +5.6% 真胜 baseline。
52
+ _Out of 6 retrain experiments (velocity-reweight, L1 loss, default image-aug, weak image-aug, body-desc, L1+aug compound), **only weak image-aug was net positive**. Default aug strength was harmful (-11.1% vs baseline); minimal brightness-only aug + 15k step gave the result above._
53
+
54
+ ## 训练配方
55
+ _Training recipe_
56
+
57
+ ```bash
58
+ # 1. 第一段 10k step from lerobot/xvla-base
59
+ WEAK_IMAGE_AUG=1 \
60
+ BATCH_SIZE=8 \
61
+ MAX_STEPS=10000 \
62
+ SAVE_FREQ=500 \
63
+ OUTPUT_DIR=$LEISAAC/outputs/xvla-leisaac-pick-orange.weakaug \
64
+ bash LeIsaac/scripts/finetune/xvla/train.sh
65
+
66
+ # 2. 第二段 resume 10k → 15k(关键,10k 时刻还未出现 formal success)
67
+ WEAK_IMAGE_AUG=1 \
68
+ BATCH_SIZE=8 \
69
+ MAX_STEPS=15000 \
70
+ SAVE_FREQ=500 \
71
+ RESUME=true \
72
+ OUTPUT_DIR=$LEISAAC/outputs/xvla-leisaac-pick-orange.weakaug \
73
+ bash LeIsaac/scripts/finetune/xvla/train.sh
74
+ ```
75
+
76
+ `WEAK_IMAGE_AUG=1` 在 [`train.sh`](https://github.com/vitorcen/LeIsaac-Training/blob/main/scripts/finetune/xvla/train.sh) 内展开为:
77
+
78
+ ```
79
+ --dataset.image_transforms.enable=true
80
+ --dataset.image_transforms.max_num_transforms=1
81
+ --dataset.image_transforms.tfs={"brightness":{"weight":1.0,"type":"ColorJitter","kwargs":{"brightness":[0.95,1.05]}}}
82
+ ```
83
+
84
+ 即:每 batch 至多采样 1 个 transform,且只允许 brightness ±5%(关闭 contrast / saturation / hue / SharpnessJitter / RandomAffine)。
85
+
86
+ 详细对比见 [完整 retrain 聚合表](#完整-retrain-实验聚合表)。
87
+
88
+ ## 推理 / Inference
89
+
90
+ ### 端到端 server(Isaac Sim ZMQ 客户端兼容)
91
+
92
+ ```bash
93
+ # 启动 X-VLA 推理服务(ZMQ REQ/REP + msgpack)
94
+ N_ACTION_STEPS=32 \
95
+ PROMPT="Pick up the orange and put it in the plate" \
96
+ CKPT=<this_repo_dir> \
97
+ PORT=5558 \
98
+ bash server/serve_xvla.sh --detach
99
+
100
+ # 在 Isaac Sim 客户端跑 PickOrange eval
101
+ POLICY_PORT=5558 \
102
+ POLICY_TIMEOUT_MS=3000 \
103
+ ACTION_HORIZON=1 \
104
+ EVAL_ROUNDS=6 \
105
+ EPISODE_LENGTH=60 \
106
+ PROMPT="Pick up the orange and put it in the plate" \
107
+ MAX_ROUND_WALL_S=90 \
108
+ bash server/eval_pi05.sh
109
+ ```
110
+
111
+ [Server 实现](https://github.com/vitorcen/isaaclab-experience/blob/main/server/xvla_leisaac/server.py)。eval 脚本与 π0.5/SmolVLA/ACT/GR00T 共用。
112
+
113
+ ### 🔴 推理关键配置 / Critical inference caveat
114
+
115
+ | `n_action_steps` | 6-round oranges | per-ep | 备注 |
116
+ |---|---|---|---|
117
+ | 8 (lerobot default) | **0/18** ❌ | 0% | 每步 replan,chunk[0]→chunk[0]→... 互相打架 |
118
+ | 16 | 4/18 | 22% | 部分 chunk 复用 |
119
+ | **32 (= chunk_size)** | **7/18 + 1 success** ⭐ | **39%** | 全 chunk 复用,单 chunk 自洽 |
120
+
121
+ **X-VLA 的 RF action head 一次性生成 32-step chunk,必须让 chunk 在 env 里全部展开**才能体现其规划价值。每步 re-plan 反而让 chunk 序列错位。
122
+
123
+ ## 评测结果
124
+ _Evaluation_
125
+
126
+ 6 round × 60s × Isaac Sim PickOrange (`LeIsaac-SO101-PickOrange-v0`),`max_round_wall_s=90`:
127
+
128
+ | Episode | oranges placed | wall time | 备注 |
129
+ |---|---|---|---|
130
+ | 1 | **3/3** ✅ | **31.0s** | **formal env success** ⭐⭐⭐ |
131
+ | 2 | 2/3 | 90.0s | wall_cap |
132
+ | 3 | 0/3 | 90.1s | wall_cap |
133
+ | 4 | 2/3 | 90.0s | wall_cap |
134
+ | 5 | 0/3 | 90.0s | wall_cap |
135
+ | 6 | 0/3 | 90.0s | wall_cap |
136
+ | **Total** | **7/18 (39%)** | — | **1/6 success rate** |
137
+
138
+ ### 完整 retrain 实验聚合表
139
+
140
+ | Retrain config (5 ckpts × 6-round = 90 ep) | per-ep aggregate | vs baseline |
141
+ |---|---|---|
142
+ | 🥇 **Weak image-aug (brightness ±5%)** | **30.0%** | **+5.6** ⭐ |
143
+ | L1 loss (OFT-lite, [Fine-Tuning VLA 2502.19645](https://arxiv.org/abs/2502.19645)) | 27.8% | +3.4 |
144
+ | Baseline (no retrain) | 24.4% | — |
145
+ | L1 + weak aug compound | 15.6% | -8.8 (负干扰) |
146
+ | Default image-aug (lerobot 默认强度) | 13.3% | -11.1 |
147
+ | Velocity-reweight β=2.0 ([AttenA+ 2605.13548](https://arxiv.org/abs/2605.13548)) | ~11% | -13 |
148
+
149
+ 详见父项目 HTML 设计文档 [`vla_improvement_methods_checklist.html`](https://github.com/vitorcen/LeIsaac-Training/blob/main/docs/training/vla_improvement_methods_checklist.html)(含 90+ 个 hyperparam sweep CSV)。
150
+
151
+ ## 已证伪 / 不要再试的方法
152
+ _Negative findings — DO NOT repeat_
153
+
154
+ 90+ 实验中已严格证伪(≥36 ep cumulative):
155
+
156
+ - ❌ **TAE (Temporal Action Ensembling, [ALOHA 2304.13705](https://arxiv.org/abs/2304.13705))**:K∈{2,4,8} × m∈{0.1,0.3} 全部 ≤1/9。X-VLA 的 RF + 10-step denoising 本身就有平滑性。
157
+ - ❌ **EMA action smoothing α∈[0.2, 0.7]**:3-round 上 α=0.3=5/9 是单 ep outlier;12-round retest = 2/18,实际有害。
158
+ - ❌ **"Grasp" verb in prompt**:0/18 完全死掉。可能 OXE 数据集里 "grasp" 关联到 hand-pose 而非 robot reach trajectory。
159
+ - ❌ **"all <plural>" prompts**:3/18,触发多目标歧义。
160
+ - ❌ **短 prompt 缺 "Pick up" preamble**:1/18,无法 ground。
161
+ - ❌ **"on/onto the plate" 介词**:≤2/18,远不如 "in the plate"(容器语义)。
162
+ - ❌ **Body-desc retrain (Path 2)**:Florence2 freeze 下长 prompt 只是 token 微扰,不改 action conditioning。
163
+ - ❌ **Offline action-MSE eval**:不预测 closed-loop(多次证伪)。**只能 Isaac Sim 实测**。
164
+ - ❌ **3-round closed-loop eval**:方差 ±15-30%。**所有决策必须 ≥6-round (≥18 ep),对比必须 ≥12-round (≥36 ep)**。
165
+
166
+ ## 限制 / Limitations
167
+
168
+ - **样本数小**:~39% per-ep 是 6-round (18 ep) 估计,置信区间 ±15%。15k 的 `1/6 formal success + 3/3@31s` 是单 ep 突破,未做 12-round 复测以确认稳定性。Replicate 看是否能保持。
169
+ - **数据集只有 50 demo**:retrain 改 loss / aug 普遍过激;扩到 80-100 demo 应能突破当前 39% 上限。
170
+ - **place 子任务多模态**:模型偶尔抓起后悬空抖动(中间 ckpts 14/13k 频发)。可能需要 DAgger 或 synthetic relabel 修 covariate shift。
171
+ - **chunk_size=32 与 wall_clock**:1 chunk = 32 step × 33ms ≈ 1s 规划周期。比 ACT (chunk=100, 3.3s 周期) 灵活但比 DP DDIM-32 慢(200ms 周期)。
172
+
173
+ ## 引用 / Citations
174
+
175
+ - **X-VLA**: Zhao et al., [_X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model_](https://arxiv.org/abs/2510.10274), 2025.
176
+ - **OFT recipe (L1 loss baseline)**: [_Fine-Tuning VLA: Optimizing Speed and Success_](https://arxiv.org/abs/2502.19645), 2025.
177
+ - **LeIsaac SO-101 PickOrange**: [LightwheelAI/leisaac](https://github.com/LightwheelAI/leisaac).
178
+ - **lerobot**: [HuggingFace lerobot](https://github.com/huggingface/lerobot).
179
+
180
+ ## License
181
+
182
+ Apache-2.0,与 lerobot / X-VLA-base 一致。
config.json ADDED
@@ -0,0 +1,230 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "type": "xvla",
3
+ "n_obs_steps": 2,
4
+ "input_features": {
5
+ "observation.images.image": {
6
+ "type": "VISUAL",
7
+ "shape": [
8
+ 3,
9
+ 256,
10
+ 256
11
+ ]
12
+ },
13
+ "observation.images.image2": {
14
+ "type": "VISUAL",
15
+ "shape": [
16
+ 3,
17
+ 256,
18
+ 256
19
+ ]
20
+ },
21
+ "observation.state": {
22
+ "type": "STATE",
23
+ "shape": [
24
+ 8
25
+ ]
26
+ },
27
+ "observation.images.image3": {
28
+ "type": "VISUAL",
29
+ "shape": [
30
+ 3,
31
+ 224,
32
+ 224
33
+ ]
34
+ },
35
+ "observation.images.empty_camera_0": {
36
+ "type": "VISUAL",
37
+ "shape": [
38
+ 3,
39
+ 224,
40
+ 224
41
+ ]
42
+ }
43
+ },
44
+ "output_features": {
45
+ "action": {
46
+ "type": "ACTION",
47
+ "shape": [
48
+ 6
49
+ ]
50
+ }
51
+ },
52
+ "device": "cuda",
53
+ "use_amp": false,
54
+ "use_peft": false,
55
+ "push_to_hub": false,
56
+ "repo_id": null,
57
+ "private": null,
58
+ "tags": null,
59
+ "license": null,
60
+ "pretrained_path": "/home/david/work/isaaclab-experience/LeIsaac/outputs/xvla-leisaac-pick-orange.weakaug/checkpoints/last/pretrained_model",
61
+ "chunk_size": 32,
62
+ "n_action_steps": 8,
63
+ "dtype": "bfloat16",
64
+ "normalization_mapping": {
65
+ "STATE": "MEAN_STD",
66
+ "ACTION": "MIN_MAX",
67
+ "VISUAL": "IDENTITY"
68
+ },
69
+ "florence_config": {
70
+ "model_type": "florence2",
71
+ "bos_token_id": 0,
72
+ "eos_token_id": 2,
73
+ "ignore_index": -100,
74
+ "pad_token_id": 1,
75
+ "projection_dim": 1024,
76
+ "text_config": {
77
+ "vocab_size": 51289,
78
+ "activation_dropout": 0.1,
79
+ "activation_function": "gelu",
80
+ "add_bias_logits": false,
81
+ "add_final_layer_norm": false,
82
+ "attention_dropout": 0.1,
83
+ "bos_token_id": 0,
84
+ "classif_dropout": 0.1,
85
+ "classifier_dropout": 0.0,
86
+ "d_model": 1024,
87
+ "decoder_attention_heads": 16,
88
+ "decoder_ffn_dim": 4096,
89
+ "decoder_layerdrop": 0.0,
90
+ "decoder_layers": 12,
91
+ "decoder_start_token_id": 2,
92
+ "dropout": 0.1,
93
+ "early_stopping": true,
94
+ "encoder_attention_heads": 16,
95
+ "encoder_ffn_dim": 4096,
96
+ "encoder_layerdrop": 0.0,
97
+ "encoder_layers": 12,
98
+ "eos_token_id": 2,
99
+ "forced_eos_token_id": 2,
100
+ "forced_bos_token_id": 0,
101
+ "gradient_checkpointing": false,
102
+ "init_std": 0.02,
103
+ "is_encoder_decoder": true,
104
+ "label2id": {
105
+ "LABEL_0": 0,
106
+ "LABEL_1": 1,
107
+ "LABEL_2": 2
108
+ },
109
+ "max_position_embeddings": 4096,
110
+ "no_repeat_ngram_size": 3,
111
+ "normalize_before": false,
112
+ "num_hidden_layers": 12,
113
+ "pad_token_id": 1,
114
+ "scale_embedding": false,
115
+ "num_beams": 3
116
+ },
117
+ "vision_config": {
118
+ "model_type": "davit",
119
+ "drop_path_rate": 0.1,
120
+ "patch_size": [
121
+ 7,
122
+ 3,
123
+ 3,
124
+ 3
125
+ ],
126
+ "patch_stride": [
127
+ 4,
128
+ 2,
129
+ 2,
130
+ 2
131
+ ],
132
+ "patch_padding": [
133
+ 3,
134
+ 1,
135
+ 1,
136
+ 1
137
+ ],
138
+ "patch_prenorm": [
139
+ false,
140
+ true,
141
+ true,
142
+ true
143
+ ],
144
+ "enable_checkpoint": false,
145
+ "dim_embed": [
146
+ 256,
147
+ 512,
148
+ 1024,
149
+ 2048
150
+ ],
151
+ "num_heads": [
152
+ 8,
153
+ 16,
154
+ 32,
155
+ 64
156
+ ],
157
+ "num_groups": [
158
+ 8,
159
+ 16,
160
+ 32,
161
+ 64
162
+ ],
163
+ "depths": [
164
+ 1,
165
+ 1,
166
+ 9,
167
+ 1
168
+ ],
169
+ "window_size": 12,
170
+ "projection_dim": 1024,
171
+ "visual_temporal_embedding": {
172
+ "type": "COSINE",
173
+ "max_temporal_embeddings": 100
174
+ },
175
+ "image_pos_embed": {
176
+ "type": "learned_abs_2d",
177
+ "max_pos_embeddings": 50
178
+ },
179
+ "image_feature_source": [
180
+ "spatial_avg_pool",
181
+ "temporal_avg_pool"
182
+ ]
183
+ },
184
+ "vocab_size": 51289,
185
+ "torch_dtype": "float32",
186
+ "is_encoder_decoder": true
187
+ },
188
+ "tokenizer_name": "facebook/bart-large",
189
+ "tokenizer_max_length": 1024,
190
+ "tokenizer_padding_side": "right",
191
+ "pad_language_to": "max_length",
192
+ "hidden_size": 1024,
193
+ "depth": 24,
194
+ "num_heads": 16,
195
+ "mlp_ratio": 4.0,
196
+ "num_domains": 30,
197
+ "len_soft_prompts": 32,
198
+ "dim_time": 32,
199
+ "max_len_seq": 512,
200
+ "use_hetero_proj": false,
201
+ "action_mode": "so101_single",
202
+ "num_denoising_steps": 10,
203
+ "use_proprio": true,
204
+ "max_state_dim": 20,
205
+ "max_action_dim": 20,
206
+ "domain_feature_key": null,
207
+ "resize_imgs_with_padding": [
208
+ 224,
209
+ 224
210
+ ],
211
+ "num_image_views": 5,
212
+ "empty_cameras": 1,
213
+ "freeze_vision_encoder": true,
214
+ "freeze_language_encoder": true,
215
+ "train_policy_transformer": true,
216
+ "train_soft_prompts": true,
217
+ "optimizer_lr": 0.0001,
218
+ "optimizer_betas": [
219
+ 0.9,
220
+ 0.95
221
+ ],
222
+ "optimizer_eps": 1e-08,
223
+ "optimizer_weight_decay": 0.0001,
224
+ "optimizer_grad_clip_norm": 10.0,
225
+ "optimizer_soft_prompt_lr_scale": 1.0,
226
+ "optimizer_soft_prompt_warmup_lr_scale": null,
227
+ "scheduler_warmup_steps": 200,
228
+ "scheduler_decay_steps": 30000,
229
+ "scheduler_decay_lr": 2.5e-06
230
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5f83abe2909f4110de3c14a852948af9a2c3d32813af00e6cb28c91e0ed9c6d4
3
+ size 1759596986
policy_postprocessor.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name": "policy_postprocessor",
3
+ "steps": [
4
+ {
5
+ "registry_name": "unnormalizer_processor",
6
+ "config": {
7
+ "eps": 1e-08,
8
+ "features": {
9
+ "action": {
10
+ "type": "ACTION",
11
+ "shape": [
12
+ 6
13
+ ]
14
+ }
15
+ },
16
+ "norm_map": {
17
+ "STATE": "MEAN_STD",
18
+ "ACTION": "MIN_MAX",
19
+ "VISUAL": "IDENTITY"
20
+ }
21
+ },
22
+ "state_file": "policy_postprocessor_step_0_unnormalizer_processor.safetensors"
23
+ },
24
+ {
25
+ "registry_name": "device_processor",
26
+ "config": {
27
+ "device": "cpu",
28
+ "float_dtype": null
29
+ }
30
+ }
31
+ ]
32
+ }
policy_postprocessor_step_0_unnormalizer_processor.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5ac4af145fa293fb9282322bee7c87eb369ba8aca3e09dbf1db7600f46142fd5
3
+ size 7552
policy_preprocessor.json ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "name": "policy_preprocessor",
3
+ "steps": [
4
+ {
5
+ "registry_name": "rename_observations_processor",
6
+ "config": {
7
+ "rename_map": {
8
+ "observation.images.front": "observation.images.image",
9
+ "observation.images.wrist": "observation.images.image2"
10
+ }
11
+ }
12
+ },
13
+ {
14
+ "registry_name": "to_batch_processor",
15
+ "config": {}
16
+ },
17
+ {
18
+ "registry_name": "tokenizer_processor",
19
+ "config": {
20
+ "max_length": 50,
21
+ "task_key": "task",
22
+ "padding_side": "right",
23
+ "padding": "max_length",
24
+ "truncation": true,
25
+ "tokenizer_name": "facebook/bart-large"
26
+ }
27
+ },
28
+ {
29
+ "registry_name": "xvla_add_domain_id",
30
+ "config": {
31
+ "domain_id": 0
32
+ }
33
+ },
34
+ {
35
+ "registry_name": "xvla_image_to_float",
36
+ "config": {
37
+ "image_keys": null,
38
+ "validate_range": true
39
+ }
40
+ },
41
+ {
42
+ "registry_name": "xvla_imagenet_normalize",
43
+ "config": {
44
+ "image_keys": null
45
+ }
46
+ },
47
+ {
48
+ "registry_name": "device_processor",
49
+ "config": {
50
+ "device": "cuda",
51
+ "float_dtype": null
52
+ }
53
+ },
54
+ {
55
+ "registry_name": "normalizer_processor",
56
+ "config": {
57
+ "eps": 1e-08,
58
+ "features": {
59
+ "observation.images.image": {
60
+ "type": "VISUAL",
61
+ "shape": [
62
+ 3,
63
+ 256,
64
+ 256
65
+ ]
66
+ },
67
+ "observation.images.image2": {
68
+ "type": "VISUAL",
69
+ "shape": [
70
+ 3,
71
+ 256,
72
+ 256
73
+ ]
74
+ },
75
+ "observation.state": {
76
+ "type": "STATE",
77
+ "shape": [
78
+ 8
79
+ ]
80
+ },
81
+ "observation.images.image3": {
82
+ "type": "VISUAL",
83
+ "shape": [
84
+ 3,
85
+ 224,
86
+ 224
87
+ ]
88
+ },
89
+ "observation.images.empty_camera_0": {
90
+ "type": "VISUAL",
91
+ "shape": [
92
+ 3,
93
+ 224,
94
+ 224
95
+ ]
96
+ },
97
+ "action": {
98
+ "type": "ACTION",
99
+ "shape": [
100
+ 6
101
+ ]
102
+ }
103
+ },
104
+ "norm_map": {
105
+ "STATE": "MEAN_STD",
106
+ "ACTION": "MIN_MAX",
107
+ "VISUAL": "IDENTITY"
108
+ }
109
+ },
110
+ "state_file": "policy_preprocessor_step_7_normalizer_processor.safetensors"
111
+ }
112
+ ]
113
+ }
policy_preprocessor_step_7_normalizer_processor.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5ac4af145fa293fb9282322bee7c87eb369ba8aca3e09dbf1db7600f46142fd5
3
+ size 7552
train_config.json ADDED
@@ -0,0 +1,370 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "dataset": {
3
+ "repo_id": "leisaac/pick-orange",
4
+ "root": "/home/david/work/isaaclab-experience/LeIsaac/datasets/raw/leisaac-pick-orange",
5
+ "episodes": [
6
+ 0,
7
+ 1,
8
+ 2,
9
+ 3,
10
+ 4,
11
+ 5,
12
+ 6,
13
+ 7,
14
+ 8,
15
+ 9,
16
+ 10,
17
+ 11,
18
+ 12,
19
+ 13,
20
+ 14,
21
+ 15,
22
+ 16,
23
+ 17,
24
+ 18,
25
+ 19,
26
+ 20,
27
+ 21,
28
+ 22,
29
+ 23,
30
+ 24,
31
+ 25,
32
+ 26,
33
+ 27,
34
+ 28,
35
+ 29,
36
+ 30,
37
+ 31,
38
+ 32,
39
+ 33,
40
+ 34,
41
+ 35,
42
+ 36,
43
+ 37,
44
+ 38,
45
+ 39,
46
+ 40,
47
+ 41,
48
+ 42,
49
+ 43,
50
+ 44,
51
+ 45,
52
+ 46,
53
+ 47,
54
+ 48,
55
+ 49
56
+ ],
57
+ "image_transforms": {
58
+ "enable": true,
59
+ "max_num_transforms": 1,
60
+ "random_order": false,
61
+ "tfs": {
62
+ "brightness": {
63
+ "weight": 1.0,
64
+ "type": "ColorJitter",
65
+ "kwargs": {
66
+ "brightness": [
67
+ 0.95,
68
+ 1.05
69
+ ]
70
+ }
71
+ }
72
+ }
73
+ },
74
+ "revision": null,
75
+ "use_imagenet_stats": true,
76
+ "video_backend": "torchcodec",
77
+ "return_uint8": false,
78
+ "streaming": false
79
+ },
80
+ "env": null,
81
+ "policy": {
82
+ "type": "xvla",
83
+ "n_obs_steps": 2,
84
+ "input_features": {
85
+ "observation.images.image": {
86
+ "type": "VISUAL",
87
+ "shape": [
88
+ 3,
89
+ 256,
90
+ 256
91
+ ]
92
+ },
93
+ "observation.images.image2": {
94
+ "type": "VISUAL",
95
+ "shape": [
96
+ 3,
97
+ 256,
98
+ 256
99
+ ]
100
+ },
101
+ "observation.state": {
102
+ "type": "STATE",
103
+ "shape": [
104
+ 8
105
+ ]
106
+ },
107
+ "observation.images.image3": {
108
+ "type": "VISUAL",
109
+ "shape": [
110
+ 3,
111
+ 224,
112
+ 224
113
+ ]
114
+ },
115
+ "observation.images.empty_camera_0": {
116
+ "type": "VISUAL",
117
+ "shape": [
118
+ 3,
119
+ 224,
120
+ 224
121
+ ]
122
+ }
123
+ },
124
+ "output_features": {
125
+ "action": {
126
+ "type": "ACTION",
127
+ "shape": [
128
+ 6
129
+ ]
130
+ }
131
+ },
132
+ "device": "cuda",
133
+ "use_amp": false,
134
+ "use_peft": false,
135
+ "push_to_hub": false,
136
+ "repo_id": null,
137
+ "private": null,
138
+ "tags": null,
139
+ "license": null,
140
+ "pretrained_path": "/home/david/work/isaaclab-experience/LeIsaac/outputs/xvla-leisaac-pick-orange.weakaug/checkpoints/last/pretrained_model",
141
+ "chunk_size": 32,
142
+ "n_action_steps": 8,
143
+ "dtype": "bfloat16",
144
+ "normalization_mapping": {
145
+ "STATE": "MEAN_STD",
146
+ "ACTION": "MIN_MAX",
147
+ "VISUAL": "IDENTITY"
148
+ },
149
+ "florence_config": {
150
+ "model_type": "florence2",
151
+ "bos_token_id": 0,
152
+ "eos_token_id": 2,
153
+ "ignore_index": -100,
154
+ "pad_token_id": 1,
155
+ "projection_dim": 1024,
156
+ "text_config": {
157
+ "vocab_size": 51289,
158
+ "activation_dropout": 0.1,
159
+ "activation_function": "gelu",
160
+ "add_bias_logits": false,
161
+ "add_final_layer_norm": false,
162
+ "attention_dropout": 0.1,
163
+ "bos_token_id": 0,
164
+ "classif_dropout": 0.1,
165
+ "classifier_dropout": 0.0,
166
+ "d_model": 1024,
167
+ "decoder_attention_heads": 16,
168
+ "decoder_ffn_dim": 4096,
169
+ "decoder_layerdrop": 0.0,
170
+ "decoder_layers": 12,
171
+ "decoder_start_token_id": 2,
172
+ "dropout": 0.1,
173
+ "early_stopping": true,
174
+ "encoder_attention_heads": 16,
175
+ "encoder_ffn_dim": 4096,
176
+ "encoder_layerdrop": 0.0,
177
+ "encoder_layers": 12,
178
+ "eos_token_id": 2,
179
+ "forced_eos_token_id": 2,
180
+ "forced_bos_token_id": 0,
181
+ "gradient_checkpointing": false,
182
+ "init_std": 0.02,
183
+ "is_encoder_decoder": true,
184
+ "label2id": {
185
+ "LABEL_0": 0,
186
+ "LABEL_1": 1,
187
+ "LABEL_2": 2
188
+ },
189
+ "max_position_embeddings": 4096,
190
+ "no_repeat_ngram_size": 3,
191
+ "normalize_before": false,
192
+ "num_hidden_layers": 12,
193
+ "pad_token_id": 1,
194
+ "scale_embedding": false,
195
+ "num_beams": 3
196
+ },
197
+ "vision_config": {
198
+ "model_type": "davit",
199
+ "drop_path_rate": 0.1,
200
+ "patch_size": [
201
+ 7,
202
+ 3,
203
+ 3,
204
+ 3
205
+ ],
206
+ "patch_stride": [
207
+ 4,
208
+ 2,
209
+ 2,
210
+ 2
211
+ ],
212
+ "patch_padding": [
213
+ 3,
214
+ 1,
215
+ 1,
216
+ 1
217
+ ],
218
+ "patch_prenorm": [
219
+ false,
220
+ true,
221
+ true,
222
+ true
223
+ ],
224
+ "enable_checkpoint": false,
225
+ "dim_embed": [
226
+ 256,
227
+ 512,
228
+ 1024,
229
+ 2048
230
+ ],
231
+ "num_heads": [
232
+ 8,
233
+ 16,
234
+ 32,
235
+ 64
236
+ ],
237
+ "num_groups": [
238
+ 8,
239
+ 16,
240
+ 32,
241
+ 64
242
+ ],
243
+ "depths": [
244
+ 1,
245
+ 1,
246
+ 9,
247
+ 1
248
+ ],
249
+ "window_size": 12,
250
+ "projection_dim": 1024,
251
+ "visual_temporal_embedding": {
252
+ "type": "COSINE",
253
+ "max_temporal_embeddings": 100
254
+ },
255
+ "image_pos_embed": {
256
+ "type": "learned_abs_2d",
257
+ "max_pos_embeddings": 50
258
+ },
259
+ "image_feature_source": [
260
+ "spatial_avg_pool",
261
+ "temporal_avg_pool"
262
+ ]
263
+ },
264
+ "vocab_size": 51289,
265
+ "torch_dtype": "float32",
266
+ "is_encoder_decoder": true
267
+ },
268
+ "tokenizer_name": "facebook/bart-large",
269
+ "tokenizer_max_length": 1024,
270
+ "tokenizer_padding_side": "right",
271
+ "pad_language_to": "max_length",
272
+ "hidden_size": 1024,
273
+ "depth": 24,
274
+ "num_heads": 16,
275
+ "mlp_ratio": 4.0,
276
+ "num_domains": 30,
277
+ "len_soft_prompts": 32,
278
+ "dim_time": 32,
279
+ "max_len_seq": 512,
280
+ "use_hetero_proj": false,
281
+ "action_mode": "so101_single",
282
+ "num_denoising_steps": 10,
283
+ "use_proprio": true,
284
+ "max_state_dim": 20,
285
+ "max_action_dim": 20,
286
+ "domain_feature_key": null,
287
+ "resize_imgs_with_padding": [
288
+ 224,
289
+ 224
290
+ ],
291
+ "num_image_views": 5,
292
+ "empty_cameras": 1,
293
+ "freeze_vision_encoder": true,
294
+ "freeze_language_encoder": true,
295
+ "train_policy_transformer": true,
296
+ "train_soft_prompts": true,
297
+ "optimizer_lr": 0.0001,
298
+ "optimizer_betas": [
299
+ 0.9,
300
+ 0.95
301
+ ],
302
+ "optimizer_eps": 1e-08,
303
+ "optimizer_weight_decay": 0.0001,
304
+ "optimizer_grad_clip_norm": 10.0,
305
+ "optimizer_soft_prompt_lr_scale": 1.0,
306
+ "optimizer_soft_prompt_warmup_lr_scale": null,
307
+ "scheduler_warmup_steps": 200,
308
+ "scheduler_decay_steps": 30000,
309
+ "scheduler_decay_lr": 2.5e-06
310
+ },
311
+ "reward_model": null,
312
+ "output_dir": "/home/david/work/isaaclab-experience/LeIsaac/outputs/xvla-leisaac-pick-orange.weakaug",
313
+ "job_name": "xvla",
314
+ "resume": true,
315
+ "seed": 1000,
316
+ "cudnn_deterministic": false,
317
+ "num_workers": 4,
318
+ "batch_size": 8,
319
+ "prefetch_factor": 4,
320
+ "persistent_workers": true,
321
+ "steps": 15000,
322
+ "eval_freq": 20000,
323
+ "log_freq": 200,
324
+ "tolerance_s": 0.0001,
325
+ "save_checkpoint": true,
326
+ "save_freq": 500,
327
+ "use_policy_training_preset": true,
328
+ "optimizer": {
329
+ "type": "xvla-adamw",
330
+ "lr": 0.0001,
331
+ "weight_decay": 0.0001,
332
+ "grad_clip_norm": 10.0,
333
+ "betas": [
334
+ 0.9,
335
+ 0.95
336
+ ],
337
+ "eps": 1e-08,
338
+ "soft_prompt_lr_scale": 1.0,
339
+ "soft_prompt_warmup_lr_scale": null
340
+ },
341
+ "scheduler": {
342
+ "type": "cosine_decay_with_warmup",
343
+ "num_warmup_steps": 200,
344
+ "num_decay_steps": 30000,
345
+ "peak_lr": 0.0001,
346
+ "decay_lr": 2.5e-06
347
+ },
348
+ "eval": {
349
+ "n_episodes": 50,
350
+ "batch_size": 22,
351
+ "use_async_envs": true
352
+ },
353
+ "wandb": {
354
+ "enable": false,
355
+ "disable_artifact": false,
356
+ "project": "lerobot",
357
+ "entity": null,
358
+ "notes": null,
359
+ "run_id": null,
360
+ "mode": null,
361
+ "add_tags": true
362
+ },
363
+ "peft": null,
364
+ "sample_weighting": null,
365
+ "rename_map": {
366
+ "observation.images.front": "observation.images.image",
367
+ "observation.images.wrist": "observation.images.image2"
368
+ },
369
+ "checkpoint_path": "/home/david/work/isaaclab-experience/LeIsaac/outputs/xvla-leisaac-pick-orange.weakaug/checkpoints/last"
370
+ }
x-vla-pick-orange.jpg ADDED

Git LFS Details

  • SHA256: 0fe6efadce64d066800b8b1fa1ed408aee82626046f6211f9e0d7b30100a45f0
  • Pointer size: 131 Bytes
  • Size of remote file: 155 kB