Add files using upload-large-folder tool

Browse files

Files changed (10) hide show

.gitattributes +1 -0
README.md +182 -0
config.json +230 -0
model.safetensors +3 -0
policy_postprocessor.json +32 -0
policy_postprocessor_step_0_unnormalizer_processor.safetensors +3 -0
policy_preprocessor.json +113 -0
policy_preprocessor_step_7_normalizer_processor.safetensors +3 -0
train_config.json +370 -0
x-vla-pick-orange.jpg +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+x-vla-pick-orange.jpg filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,182 @@

+---
+license: apache-2.0
+library_name: lerobot
+pipeline_tag: robotics
+tags:
+  - xvla
+  - x-vla
+  - lerobot
+  - so101
+  - leisaac
+  - pick-orange
+  - isaac-sim
+  - rectified-flow
+  - florence2
+datasets:
+  - LightwheelAI/leisaac-pick-orange
+language:
+  - en
+base_model: lerobot/xvla-base
+---
+# X-VLA-PickOrange
+针对 [LeIsaac SO-101 PickOrange](https://github.com/LightwheelAI/leisaac) 任务从 [X-VLA-base](https://huggingface.co/lerobot/xvla-base) 微调的 [X-VLA](https://arxiv.org/abs/2510.10274) (Florence2 + Soft-Prompted Transformer + Rectified-Flow action head, 0.9B params) 策略。
+_An [X-VLA](https://arxiv.org/abs/2510.10274) (Florence2 + Soft-Prompted Transformer + Rectified-Flow action head, 0.9 B params) policy fine-tuned from [X-VLA-base](https://huggingface.co/lerobot/xvla-base) on the [LeIsaac SO-101 PickOrange](https://github.com/LightwheelAI/leisaac) task._
+![X-VLA-PickOrange — SO-101 in Isaac Sim](x-vla-pick-orange.jpg)
+**🔗 项目仓库 / Project repos**：
+- [vitorcen/isaaclab-experience](https://github.com/vitorcen/isaaclab-experience) — Isaac Lab + LeIsaac 多策略横评（parent project）
+- [vitorcen/LeIsaac-Training](https://github.com/vitorcen/LeIsaac-Training) — LeIsaac fork（训练脚本 + 设计文档 / training scripts + design docs）
+## TL;DR
+- **任务 / Task**：`Pick up the orange and put it in the plate` — SO-101 单臂依次夹起 3 颗橙子并放盘子。
+  _Single-arm SO-101 picks 3 oranges sequentially and places each in a plate._
+- **数据集 / Dataset**：[`LightwheelAI/leisaac-pick-orange`](https://huggingface.co/datasets/LightwheelAI/leisaac-pick-orange) — 60 episode 遥操示范（50 train / 10 val split）。
+- **架构 / Architecture**：X-VLA — Florence2 vision-language encoder + Soft-Prompted Transformer + Rectified-Flow action head（10 denoising steps）。chunk_size=32，n_obs_steps=2。
+- **训练 / Training**：batch=8 / lr=1e-4 / **15k step**（10k 从 base + 5k resume）/ **weak image-aug (brightness ±5% only)** / GRIPPER_SCALE=5 / ~26 min on RTX 4090。
+- **评测 / Eval**：Isaac Sim 5.1 + LeIsaac，6 round × 60s。**7/18 oranges 放置 + 1/6 formal env success**，best 单 ep = **3/3 全部完成 @ 31s** (32% 时间预算)。
+- **⚠️ 关键 inference 配置 / Critical inference setting**：`n_action_steps=32`（chunk_size 整 reuse）。
+  默认 `n_action_steps=8` 在此 ckpt 上 6-round = **0/18 灾难性失败**（每步重 plan 互相冲突）。详见下方 [Inference caveat](#-推理关键配置--critical-inference-caveat)。
+## 模型亮点
+_Highlights_
+- **session 首次单 ep formal Isaac Sim env success + 31s 完成**：ep1 = `[True, True, True] @ 31.0s`。其他 baseline / retrain 配置整 90+ 实验从未达到 env 端确认 task done。
+  _First single-episode formal Isaac Sim env-side task completion (env reports `done=True`) in our 90+ hyperparam sweep — ep1 placed all 3 oranges in 31.0s, well under the 90s wall cap._
+- **暴露了 `n_action_steps` 的关键作用**：从 default 8 改 32 是 session 中唯一可靠的 3.5× baseline 提升。
+  _Exposes `n_action_steps` as the single most reliable improvement — switching from default 8 to chunk_size=32 (full chunk reuse) gave ~3.5× over baseline._
+- **Weak image-aug 是唯一 aggregate 正向 retrain**：lerobot 默认 ColorJitter+Sharp+Affine 在 50-demo 数据集是 over-regularize（13% per-ep）；只保留 brightness ±5%（max_num_transforms=1）反而 +5.6% 真胜 baseline。
+  _Out of 6 retrain experiments (velocity-reweight, L1 loss, default image-aug, weak image-aug, body-desc, L1+aug compound), **only weak image-aug was net positive**. Default aug strength was harmful (-11.1% vs baseline); minimal brightness-only aug + 15k step gave the result above._
+## 训练配方
+_Training recipe_
+```bash
+# 1. 第一段 10k step from lerobot/xvla-base
+WEAK_IMAGE_AUG=1 \
+BATCH_SIZE=8 \
+MAX_STEPS=10000 \
+SAVE_FREQ=500 \
+OUTPUT_DIR=$LEISAAC/outputs/xvla-leisaac-pick-orange.weakaug \
+bash LeIsaac/scripts/finetune/xvla/train.sh
+# 2. 第二段 resume 10k → 15k（关键，10k 时刻还未出现 formal success）
+WEAK_IMAGE_AUG=1 \
+BATCH_SIZE=8 \
+MAX_STEPS=15000 \
+SAVE_FREQ=500 \
+RESUME=true \
+OUTPUT_DIR=$LEISAAC/outputs/xvla-leisaac-pick-orange.weakaug \
+bash LeIsaac/scripts/finetune/xvla/train.sh
+```
+`WEAK_IMAGE_AUG=1` 在 [`train.sh`](https://github.com/vitorcen/LeIsaac-Training/blob/main/scripts/finetune/xvla/train.sh) 内展开为：
+```
+--dataset.image_transforms.enable=true
+--dataset.image_transforms.max_num_transforms=1
+--dataset.image_transforms.tfs={"brightness":{"weight":1.0,"type":"ColorJitter","kwargs":{"brightness":[0.95,1.05]}}}
+```
+即：每 batch 至多采样 1 个 transform，且只允许 brightness ±5%（关闭 contrast / saturation / hue / SharpnessJitter / RandomAffine）。
+详细对比见 [完整 retrain 聚合表](#完整-retrain-实验聚合表)。
+## 推理 / Inference
+### 端到端 server（Isaac Sim ZMQ 客户端兼容）
+```bash
+# 启动 X-VLA 推理服务（ZMQ REQ/REP + msgpack）
+N_ACTION_STEPS=32 \
+PROMPT="Pick up the orange and put it in the plate" \
+CKPT=<this_repo_dir> \
+PORT=5558 \
+bash server/serve_xvla.sh --detach
+# 在 Isaac Sim 客户端跑 PickOrange eval
+POLICY_PORT=5558 \
+POLICY_TIMEOUT_MS=3000 \
+ACTION_HORIZON=1 \
+EVAL_ROUNDS=6 \
+EPISODE_LENGTH=60 \
+PROMPT="Pick up the orange and put it in the plate" \
+MAX_ROUND_WALL_S=90 \
+bash server/eval_pi05.sh
+```
+[Server 实现](https://github.com/vitorcen/isaaclab-experience/blob/main/server/xvla_leisaac/server.py)。eval 脚本与 π0.5/SmolVLA/ACT/GR00T 共用。
+### 🔴 推理关键配置 / Critical inference caveat
+| `n_action_steps` | 6-round oranges | per-ep | 备注 |
+|---|---|---|---|
+| 8 (lerobot default) | **0/18** ❌ | 0% | 每步 replan，chunk[0]→chunk[0]→... 互相打架 |
+| 16 | 4/18 | 22% | 部分 chunk 复用 |
+| **32 (= chunk_size)** | **7/18 + 1 success** ⭐ | **39%** | 全 chunk 复用，单 chunk 自洽 |
+**X-VLA 的 RF action head 一次性生成 32-step chunk，必须让 chunk 在 env 里全部展开**才能体现其规划价值。每步 re-plan 反而让 chunk 序列错位。
+## 评测结果
+_Evaluation_
+6 round × 60s × Isaac Sim PickOrange (`LeIsaac-SO101-PickOrange-v0`)，`max_round_wall_s=90`：
+| Episode | oranges placed | wall time | 备注 |
+|---|---|---|---|
+| 1 | **3/3** ✅ | **31.0s** | **formal env success** ⭐⭐⭐ |
+| 2 | 2/3 | 90.0s | wall_cap |
+| 3 | 0/3 | 90.1s | wall_cap |
+| 4 | 2/3 | 90.0s | wall_cap |
+| 5 | 0/3 | 90.0s | wall_cap |
+| 6 | 0/3 | 90.0s | wall_cap |
+| **Total** | **7/18 (39%)** | — | **1/6 success rate** |
+### 完整 retrain 实验聚合表
+| Retrain config (5 ckpts × 6-round = 90 ep) | per-ep aggregate | vs baseline |
+|---|---|---|
+| 🥇 **Weak image-aug (brightness ±5%)** | **30.0%** | **+5.6** ⭐ |
+| L1 loss (OFT-lite, [Fine-Tuning VLA 2502.19645](https://arxiv.org/abs/2502.19645)) | 27.8% | +3.4 |
+| Baseline (no retrain) | 24.4% | — |
+| L1 + weak aug compound | 15.6% | -8.8 (负干扰) |
+| Default image-aug (lerobot 默认强度) | 13.3% | -11.1 |
+| Velocity-reweight β=2.0 ([AttenA+ 2605.13548](https://arxiv.org/abs/2605.13548)) | ~11% | -13 |
+详见父项目 HTML 设计文档 [`vla_improvement_methods_checklist.html`](https://github.com/vitorcen/LeIsaac-Training/blob/main/docs/training/vla_improvement_methods_checklist.html)（含 90+ 个 hyperparam sweep CSV）。
+## 已证伪 / 不要再试的方法
+_Negative findings — DO NOT repeat_
+90+ 实验中已严格证伪（≥36 ep cumulative）：
+- ❌ **TAE (Temporal Action Ensembling, [ALOHA 2304.13705](https://arxiv.org/abs/2304.13705))**：K∈{2,4,8} × m∈{0.1,0.3} 全部 ≤1/9。X-VLA 的 RF + 10-step denoising 本身就有平滑性。
+- ❌ **EMA action smoothing α∈[0.2, 0.7]**：3-round 上 α=0.3=5/9 是单 ep outlier；12-round retest = 2/18，实际有害。
+- ❌ **"Grasp" verb in prompt**：0/18 完全死掉。可能 OXE 数据集里 "grasp" 关联到 hand-pose 而非 robot reach trajectory。
+- ❌ **"all <plural>" prompts**：3/18，触发多目标歧义。
+- ❌ **短 prompt 缺 "Pick up" preamble**：1/18，无法 ground。
+- ❌ **"on/onto the plate" 介词**：≤2/18，远不如 "in the plate"（容器语义）。
+- ❌ **Body-desc retrain (Path 2)**：Florence2 freeze 下长 prompt 只是 token 微扰，不改 action conditioning。
+- ❌ **Offline action-MSE eval**：不预测 closed-loop（多次证伪）。**只能 Isaac Sim 实测**。
+- ❌ **3-round closed-loop eval**：方差 ±15-30%。**所有决策必须 ≥6-round (≥18 ep)，对比必须 ≥12-round (≥36 ep)**。
+## 限制 / Limitations
+- **样本数小**：~39% per-ep 是 6-round (18 ep) 估计，置信区间 ±15%。15k 的 `1/6 formal success + 3/3@31s` 是单 ep 突破，未做 12-round 复测以确认稳定性。Replicate 看是否能保持。
+- **数据集只有 50 demo**：retrain 改 loss / aug 普遍过激；扩到 80-100 demo 应能突破当前 39% 上限。
+- **place 子任务多模态**：模型偶尔抓起后悬空抖动（中间 ckpts 14/13k 频发）。可能需要 DAgger 或 synthetic relabel 修 covariate shift。
+- **chunk_size=32 与 wall_clock**：1 chunk = 32 step × 33ms ≈ 1s 规划周期。比 ACT (chunk=100, 3.3s 周期) 灵活但比 DP DDIM-32 慢（200ms 周期）。
+## 引用 / Citations
+- **X-VLA**: Zhao et al., [_X-VLA: Soft-Prompted Transformer as Scalable Cross-Embodiment Vision-Language-Action Model_](https://arxiv.org/abs/2510.10274), 2025.
+- **OFT recipe (L1 loss baseline)**: [_Fine-Tuning VLA: Optimizing Speed and Success_](https://arxiv.org/abs/2502.19645), 2025.
+- **LeIsaac SO-101 PickOrange**: [LightwheelAI/leisaac](https://github.com/LightwheelAI/leisaac).
+- **lerobot**: [HuggingFace lerobot](https://github.com/huggingface/lerobot).
+## License
+Apache-2.0，与 lerobot / X-VLA-base 一致。

config.json ADDED Viewed

	@@ -0,0 +1,230 @@

+{
+    "type": "xvla",
+    "n_obs_steps": 2,
+    "input_features": {
+        "observation.images.image": {
+            "type": "VISUAL",
+            "shape": [
+                3,
+                256,
+                256
+            ]
+        },
+        "observation.images.image2": {
+            "type": "VISUAL",
+            "shape": [
+                3,
+                256,
+                256
+            ]
+        },
+        "observation.state": {
+            "type": "STATE",
+            "shape": [
+                8
+            ]
+        },
+        "observation.images.image3": {
+            "type": "VISUAL",
+            "shape": [
+                3,
+                224,
+                224
+            ]
+        },
+        "observation.images.empty_camera_0": {
+            "type": "VISUAL",
+            "shape": [
+                3,
+                224,
+                224
+            ]
+        }
+    },
+    "output_features": {
+        "action": {
+            "type": "ACTION",
+            "shape": [
+                6
+            ]
+        }
+    },
+    "device": "cuda",
+    "use_amp": false,
+    "use_peft": false,
+    "push_to_hub": false,
+    "repo_id": null,
+    "private": null,
+    "tags": null,
+    "license": null,
+    "pretrained_path": "/home/david/work/isaaclab-experience/LeIsaac/outputs/xvla-leisaac-pick-orange.weakaug/checkpoints/last/pretrained_model",
+    "chunk_size": 32,
+    "n_action_steps": 8,
+    "dtype": "bfloat16",
+    "normalization_mapping": {
+        "STATE": "MEAN_STD",
+        "ACTION": "MIN_MAX",
+        "VISUAL": "IDENTITY"
+    },
+    "florence_config": {
+        "model_type": "florence2",
+        "bos_token_id": 0,
+        "eos_token_id": 2,
+        "ignore_index": -100,
+        "pad_token_id": 1,
+        "projection_dim": 1024,
+        "text_config": {
+            "vocab_size": 51289,
+            "activation_dropout": 0.1,
+            "activation_function": "gelu",
+            "add_bias_logits": false,
+            "add_final_layer_norm": false,
+            "attention_dropout": 0.1,
+            "bos_token_id": 0,
+            "classif_dropout": 0.1,
+            "classifier_dropout": 0.0,
+            "d_model": 1024,
+            "decoder_attention_heads": 16,
+            "decoder_ffn_dim": 4096,
+            "decoder_layerdrop": 0.0,
+            "decoder_layers": 12,
+            "decoder_start_token_id": 2,
+            "dropout": 0.1,
+            "early_stopping": true,
+            "encoder_attention_heads": 16,
+            "encoder_ffn_dim": 4096,
+            "encoder_layerdrop": 0.0,
+            "encoder_layers": 12,
+            "eos_token_id": 2,
+            "forced_eos_token_id": 2,
+            "forced_bos_token_id": 0,
+            "gradient_checkpointing": false,
+            "init_std": 0.02,
+            "is_encoder_decoder": true,
+            "label2id": {
+                "LABEL_0": 0,
+                "LABEL_1": 1,
+                "LABEL_2": 2
+            },
+            "max_position_embeddings": 4096,
+            "no_repeat_ngram_size": 3,
+            "normalize_before": false,
+            "num_hidden_layers": 12,
+            "pad_token_id": 1,
+            "scale_embedding": false,
+            "num_beams": 3
+        },
+        "vision_config": {
+            "model_type": "davit",
+            "drop_path_rate": 0.1,
+            "patch_size": [
+                7,
+                3,
+                3,
+                3
+            ],
+            "patch_stride": [
+                4,
+                2,
+                2,
+                2
+            ],
+            "patch_padding": [
+                3,
+                1,
+                1,
+                1
+            ],
+            "patch_prenorm": [
+                false,
+                true,
+                true,
+                true
+            ],
+            "enable_checkpoint": false,
+            "dim_embed": [
+                256,
+                512,
+                1024,
+                2048
+            ],
+            "num_heads": [
+                8,
+                16,
+                32,
+                64
+            ],
+            "num_groups": [
+                8,
+                16,
+                32,
+                64
+            ],
+            "depths": [
+                1,
+                1,
+                9,
+                1
+            ],
+            "window_size": 12,
+            "projection_dim": 1024,
+            "visual_temporal_embedding": {
+                "type": "COSINE",
+                "max_temporal_embeddings": 100
+            },
+            "image_pos_embed": {
+                "type": "learned_abs_2d",
+                "max_pos_embeddings": 50
+            },
+            "image_feature_source": [
+                "spatial_avg_pool",
+                "temporal_avg_pool"
+            ]
+        },
+        "vocab_size": 51289,
+        "torch_dtype": "float32",
+        "is_encoder_decoder": true
+    },
+    "tokenizer_name": "facebook/bart-large",
+    "tokenizer_max_length": 1024,
+    "tokenizer_padding_side": "right",
+    "pad_language_to": "max_length",
+    "hidden_size": 1024,
+    "depth": 24,
+    "num_heads": 16,
+    "mlp_ratio": 4.0,
+    "num_domains": 30,
+    "len_soft_prompts": 32,
+    "dim_time": 32,
+    "max_len_seq": 512,
+    "use_hetero_proj": false,
+    "action_mode": "so101_single",
+    "num_denoising_steps": 10,
+    "use_proprio": true,
+    "max_state_dim": 20,
+    "max_action_dim": 20,
+    "domain_feature_key": null,
+    "resize_imgs_with_padding": [
+        224,
+        224
+    ],
+    "num_image_views": 5,
+    "empty_cameras": 1,
+    "freeze_vision_encoder": true,
+    "freeze_language_encoder": true,
+    "train_policy_transformer": true,
+    "train_soft_prompts": true,
+    "optimizer_lr": 0.0001,
+    "optimizer_betas": [
+        0.9,
+        0.95
+    ],
+    "optimizer_eps": 1e-08,
+    "optimizer_weight_decay": 0.0001,
+    "optimizer_grad_clip_norm": 10.0,
+    "optimizer_soft_prompt_lr_scale": 1.0,
+    "optimizer_soft_prompt_warmup_lr_scale": null,
+    "scheduler_warmup_steps": 200,
+    "scheduler_decay_steps": 30000,
+    "scheduler_decay_lr": 2.5e-06
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5f83abe2909f4110de3c14a852948af9a2c3d32813af00e6cb28c91e0ed9c6d4
+size 1759596986

policy_postprocessor.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "name": "policy_postprocessor",
+  "steps": [
+    {
+      "registry_name": "unnormalizer_processor",
+      "config": {
+        "eps": 1e-08,
+        "features": {
+          "action": {
+            "type": "ACTION",
+            "shape": [
+              6
+            ]
+          }
+        },
+        "norm_map": {
+          "STATE": "MEAN_STD",
+          "ACTION": "MIN_MAX",
+          "VISUAL": "IDENTITY"
+        }
+      },
+      "state_file": "policy_postprocessor_step_0_unnormalizer_processor.safetensors"
+    },
+    {
+      "registry_name": "device_processor",
+      "config": {
+        "device": "cpu",
+        "float_dtype": null
+      }
+    }
+  ]
+}

policy_postprocessor_step_0_unnormalizer_processor.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5ac4af145fa293fb9282322bee7c87eb369ba8aca3e09dbf1db7600f46142fd5
+size 7552

policy_preprocessor.json ADDED Viewed

	@@ -0,0 +1,113 @@

+{
+  "name": "policy_preprocessor",
+  "steps": [
+    {
+      "registry_name": "rename_observations_processor",
+      "config": {
+        "rename_map": {
+          "observation.images.front": "observation.images.image",
+          "observation.images.wrist": "observation.images.image2"
+        }
+      }
+    },
+    {
+      "registry_name": "to_batch_processor",
+      "config": {}
+    },
+    {
+      "registry_name": "tokenizer_processor",
+      "config": {
+        "max_length": 50,
+        "task_key": "task",
+        "padding_side": "right",
+        "padding": "max_length",
+        "truncation": true,
+        "tokenizer_name": "facebook/bart-large"
+      }
+    },
+    {
+      "registry_name": "xvla_add_domain_id",
+      "config": {
+        "domain_id": 0
+      }
+    },
+    {
+      "registry_name": "xvla_image_to_float",
+      "config": {
+        "image_keys": null,
+        "validate_range": true
+      }
+    },
+    {
+      "registry_name": "xvla_imagenet_normalize",
+      "config": {
+        "image_keys": null
+      }
+    },
+    {
+      "registry_name": "device_processor",
+      "config": {
+        "device": "cuda",
+        "float_dtype": null
+      }
+    },
+    {
+      "registry_name": "normalizer_processor",
+      "config": {
+        "eps": 1e-08,
+        "features": {
+          "observation.images.image": {
+            "type": "VISUAL",
+            "shape": [
+              3,
+              256,
+              256
+            ]
+          },
+          "observation.images.image2": {
+            "type": "VISUAL",
+            "shape": [
+              3,
+              256,
+              256
+            ]
+          },
+          "observation.state": {
+            "type": "STATE",
+            "shape": [
+              8
+            ]
+          },
+          "observation.images.image3": {
+            "type": "VISUAL",
+            "shape": [
+              3,
+              224,
+              224
+            ]
+          },
+          "observation.images.empty_camera_0": {
+            "type": "VISUAL",
+            "shape": [
+              3,
+              224,
+              224
+            ]
+          },
+          "action": {
+            "type": "ACTION",
+            "shape": [
+              6
+            ]
+          }
+        },
+        "norm_map": {
+          "STATE": "MEAN_STD",
+          "ACTION": "MIN_MAX",
+          "VISUAL": "IDENTITY"
+        }
+      },
+      "state_file": "policy_preprocessor_step_7_normalizer_processor.safetensors"
+    }
+  ]
+}

policy_preprocessor_step_7_normalizer_processor.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5ac4af145fa293fb9282322bee7c87eb369ba8aca3e09dbf1db7600f46142fd5
+size 7552

train_config.json ADDED Viewed

	@@ -0,0 +1,370 @@

+{
+    "dataset": {
+        "repo_id": "leisaac/pick-orange",
+        "root": "/home/david/work/isaaclab-experience/LeIsaac/datasets/raw/leisaac-pick-orange",
+        "episodes": [
+            0,
+            1,
+            2,
+            3,
+            4,
+            5,
+            6,
+            7,
+            8,
+            9,
+            10,
+            11,
+            12,
+            13,
+            14,
+            15,
+            16,
+            17,
+            18,
+            19,
+            20,
+            21,
+            22,
+            23,
+            24,
+            25,
+            26,
+            27,
+            28,
+            29,
+            30,
+            31,
+            32,
+            33,
+            34,
+            35,
+            36,
+            37,
+            38,
+            39,
+            40,
+            41,
+            42,
+            43,
+            44,
+            45,
+            46,
+            47,
+            48,
+            49
+        ],
+        "image_transforms": {
+            "enable": true,
+            "max_num_transforms": 1,
+            "random_order": false,
+            "tfs": {
+                "brightness": {
+                    "weight": 1.0,
+                    "type": "ColorJitter",
+                    "kwargs": {
+                        "brightness": [
+                            0.95,
+                            1.05
+                        ]
+                    }
+                }
+            }
+        },
+        "revision": null,
+        "use_imagenet_stats": true,
+        "video_backend": "torchcodec",
+        "return_uint8": false,
+        "streaming": false
+    },
+    "env": null,
+    "policy": {
+        "type": "xvla",
+        "n_obs_steps": 2,
+        "input_features": {
+            "observation.images.image": {
+                "type": "VISUAL",
+                "shape": [
+                    3,
+                    256,
+                    256
+                ]
+            },
+            "observation.images.image2": {
+                "type": "VISUAL",
+                "shape": [
+                    3,
+                    256,
+                    256
+                ]
+            },
+            "observation.state": {
+                "type": "STATE",
+                "shape": [
+                    8
+                ]
+            },
+            "observation.images.image3": {
+                "type": "VISUAL",
+                "shape": [
+                    3,
+                    224,
+                    224
+                ]
+            },
+            "observation.images.empty_camera_0": {
+                "type": "VISUAL",
+                "shape": [
+                    3,
+                    224,
+                    224
+                ]
+            }
+        },
+        "output_features": {
+            "action": {
+                "type": "ACTION",
+                "shape": [
+                    6
+                ]
+            }
+        },
+        "device": "cuda",
+        "use_amp": false,
+        "use_peft": false,
+        "push_to_hub": false,
+        "repo_id": null,
+        "private": null,
+        "tags": null,
+        "license": null,
+        "pretrained_path": "/home/david/work/isaaclab-experience/LeIsaac/outputs/xvla-leisaac-pick-orange.weakaug/checkpoints/last/pretrained_model",
+        "chunk_size": 32,
+        "n_action_steps": 8,
+        "dtype": "bfloat16",
+        "normalization_mapping": {
+            "STATE": "MEAN_STD",
+            "ACTION": "MIN_MAX",
+            "VISUAL": "IDENTITY"
+        },
+        "florence_config": {
+            "model_type": "florence2",
+            "bos_token_id": 0,
+            "eos_token_id": 2,
+            "ignore_index": -100,
+            "pad_token_id": 1,
+            "projection_dim": 1024,
+            "text_config": {
+                "vocab_size": 51289,
+                "activation_dropout": 0.1,
+                "activation_function": "gelu",
+                "add_bias_logits": false,
+                "add_final_layer_norm": false,
+                "attention_dropout": 0.1,
+                "bos_token_id": 0,
+                "classif_dropout": 0.1,
+                "classifier_dropout": 0.0,
+                "d_model": 1024,
+                "decoder_attention_heads": 16,
+                "decoder_ffn_dim": 4096,
+                "decoder_layerdrop": 0.0,
+                "decoder_layers": 12,
+                "decoder_start_token_id": 2,
+                "dropout": 0.1,
+                "early_stopping": true,
+                "encoder_attention_heads": 16,
+                "encoder_ffn_dim": 4096,
+                "encoder_layerdrop": 0.0,
+                "encoder_layers": 12,
+                "eos_token_id": 2,
+                "forced_eos_token_id": 2,
+                "forced_bos_token_id": 0,
+                "gradient_checkpointing": false,
+                "init_std": 0.02,
+                "is_encoder_decoder": true,
+                "label2id": {
+                    "LABEL_0": 0,
+                    "LABEL_1": 1,
+                    "LABEL_2": 2
+                },
+                "max_position_embeddings": 4096,
+                "no_repeat_ngram_size": 3,
+                "normalize_before": false,
+                "num_hidden_layers": 12,
+                "pad_token_id": 1,
+                "scale_embedding": false,
+                "num_beams": 3
+            },
+            "vision_config": {
+                "model_type": "davit",
+                "drop_path_rate": 0.1,
+                "patch_size": [
+                    7,
+                    3,
+                    3,
+                    3
+                ],
+                "patch_stride": [
+                    4,
+                    2,
+                    2,
+                    2
+                ],
+                "patch_padding": [
+                    3,
+                    1,
+                    1,
+                    1
+                ],
+                "patch_prenorm": [
+                    false,
+                    true,
+                    true,
+                    true
+                ],
+                "enable_checkpoint": false,
+                "dim_embed": [
+                    256,
+                    512,
+                    1024,
+                    2048
+                ],
+                "num_heads": [
+                    8,
+                    16,
+                    32,
+                    64
+                ],
+                "num_groups": [
+                    8,
+                    16,
+                    32,
+                    64
+                ],
+                "depths": [
+                    1,
+                    1,
+                    9,
+                    1
+                ],
+                "window_size": 12,
+                "projection_dim": 1024,
+                "visual_temporal_embedding": {
+                    "type": "COSINE",
+                    "max_temporal_embeddings": 100
+                },
+                "image_pos_embed": {
+                    "type": "learned_abs_2d",
+                    "max_pos_embeddings": 50
+                },
+                "image_feature_source": [
+                    "spatial_avg_pool",
+                    "temporal_avg_pool"
+                ]
+            },
+            "vocab_size": 51289,
+            "torch_dtype": "float32",
+            "is_encoder_decoder": true
+        },
+        "tokenizer_name": "facebook/bart-large",
+        "tokenizer_max_length": 1024,
+        "tokenizer_padding_side": "right",
+        "pad_language_to": "max_length",
+        "hidden_size": 1024,
+        "depth": 24,
+        "num_heads": 16,
+        "mlp_ratio": 4.0,
+        "num_domains": 30,
+        "len_soft_prompts": 32,
+        "dim_time": 32,
+        "max_len_seq": 512,
+        "use_hetero_proj": false,
+        "action_mode": "so101_single",
+        "num_denoising_steps": 10,
+        "use_proprio": true,
+        "max_state_dim": 20,
+        "max_action_dim": 20,
+        "domain_feature_key": null,
+        "resize_imgs_with_padding": [
+            224,
+            224
+        ],
+        "num_image_views": 5,
+        "empty_cameras": 1,
+        "freeze_vision_encoder": true,
+        "freeze_language_encoder": true,
+        "train_policy_transformer": true,
+        "train_soft_prompts": true,
+        "optimizer_lr": 0.0001,
+        "optimizer_betas": [
+            0.9,
+            0.95
+        ],
+        "optimizer_eps": 1e-08,
+        "optimizer_weight_decay": 0.0001,
+        "optimizer_grad_clip_norm": 10.0,
+        "optimizer_soft_prompt_lr_scale": 1.0,
+        "optimizer_soft_prompt_warmup_lr_scale": null,
+        "scheduler_warmup_steps": 200,
+        "scheduler_decay_steps": 30000,
+        "scheduler_decay_lr": 2.5e-06
+    },
+    "reward_model": null,
+    "output_dir": "/home/david/work/isaaclab-experience/LeIsaac/outputs/xvla-leisaac-pick-orange.weakaug",
+    "job_name": "xvla",
+    "resume": true,
+    "seed": 1000,
+    "cudnn_deterministic": false,
+    "num_workers": 4,
+    "batch_size": 8,
+    "prefetch_factor": 4,
+    "persistent_workers": true,
+    "steps": 15000,
+    "eval_freq": 20000,
+    "log_freq": 200,
+    "tolerance_s": 0.0001,
+    "save_checkpoint": true,
+    "save_freq": 500,
+    "use_policy_training_preset": true,
+    "optimizer": {
+        "type": "xvla-adamw",
+        "lr": 0.0001,
+        "weight_decay": 0.0001,
+        "grad_clip_norm": 10.0,
+        "betas": [
+            0.9,
+            0.95
+        ],
+        "eps": 1e-08,
+        "soft_prompt_lr_scale": 1.0,
+        "soft_prompt_warmup_lr_scale": null
+    },
+    "scheduler": {
+        "type": "cosine_decay_with_warmup",
+        "num_warmup_steps": 200,
+        "num_decay_steps": 30000,
+        "peak_lr": 0.0001,
+        "decay_lr": 2.5e-06
+    },
+    "eval": {
+        "n_episodes": 50,
+        "batch_size": 22,
+        "use_async_envs": true
+    },
+    "wandb": {
+        "enable": false,
+        "disable_artifact": false,
+        "project": "lerobot",
+        "entity": null,
+        "notes": null,
+        "run_id": null,
+        "mode": null,
+        "add_tags": true
+    },
+    "peft": null,
+    "sample_weighting": null,
+    "rename_map": {
+        "observation.images.front": "observation.images.image",
+        "observation.images.wrist": "observation.images.image2"
+    },
+    "checkpoint_path": "/home/david/work/isaaclab-experience/LeIsaac/outputs/xvla-leisaac-pick-orange.weakaug/checkpoints/last"
+}

x-vla-pick-orange.jpg ADDED Viewed

Git LFS Details

SHA256: 0fe6efadce64d066800b8b1fa1ed408aee82626046f6211f9e0d7b30100a45f0
Pointer size: 131 Bytes
Size of remote file: 155 kB