Add files using upload-large-folder tool

Browse files

Files changed (8) hide show

README.md +111 -0
config.json +93 -0
model.safetensors +3 -0
policy_postprocessor.json +32 -0
policy_postprocessor_step_0_unnormalizer_processor.safetensors +3 -0
policy_preprocessor.json +79 -0
policy_preprocessor_step_5_normalizer_processor.safetensors +3 -0
train_config.json +234 -0

README.md ADDED Viewed

	@@ -0,0 +1,111 @@

+---
+license: apache-2.0
+library_name: lerobot
+pipeline_tag: robotics
+tags:
+  - smolvla
+  - lerobot
+  - so101
+  - leisaac
+  - pick-orange
+  - isaac-sim
+datasets:
+  - LightwheelAI/leisaac-pick-orange
+language:
+  - en
+base_model: lerobot/smolvla_base
+---
+# SmolVLA2-PickOrange
+针对 [LeIsaac SO-101 PickOrange](https://github.com/LightwheelAI/leisaac) 任务 LoRA-free 微调的 [SmolVLA](https://huggingface.co/lerobot/smolvla_base) 策略 — 自训 30k step。
+_A fine-tuned [SmolVLA](https://huggingface.co/lerobot/smolvla_base) policy on the [LeIsaac SO-101 PickOrange](https://github.com/LightwheelAI/leisaac) task, 30k steps full-parameter from `lerobot/smolvla_base`._
+**🔗 项目仓库 / Project repos**：
+- [vitorcen/isaaclab-experience](https://github.com/vitorcen/isaaclab-experience) — Isaac Lab + LeIsaac 多策略横评（parent project）— 含 7-baseline benchmark
+- [vitorcen/LeIsaac-Training](https://github.com/vitorcen/LeIsaac-Training) — LeIsaac fork（训练脚本 + 设计文档 / training scripts + design docs）
+> **命名注意 / Naming note**：仓库名是 `SmolVLA2-PickOrange` 但 `config.type=smolvla`（v1，SmolVLM2-500M-Video-Instruct backbone + Action Expert）。LeRobot 当时没把 `smolvla2`（带 LoRA on）merge 到 main，所以这里仍是 v1。命名是 dir 命名误称延续。
+> _Despite the repo name, `config.type=smolvla` (v1). LeRobot's smolvla2 (with LoRA enabled) hadn't merged to main at training time; the "2" is carried over from the local output directory naming._
+## TL;DR
+- **任务 / Task**：`Pick up the orange and place it on the plate` — SO-101 单臂依次夹起 3 颗橙子并放盘子。
+- **数据集 / Dataset**：[`LightwheelAI/leisaac-pick-orange`](https://huggingface.co/datasets/LightwheelAI/leisaac-pick-orange) — 60 episode 遥操示范，30 fps，dual-cam 480×640。
+- **架构 / Architecture**：SmolVLA v1（450M），SmolVLM2-500M-Video-Instruct backbone + Action Expert，`chunk_size=50`。
+- **训练 / Training**：full-param 微调（无 LoRA），batch=8 / lr=1e-4 / 30k step / pyav video backend，~14h on RTX 4090。
+- **评测 / Eval**（Isaac Sim 5.1，3 round × 3 颗 = 9 颗）：
+  - **strict 1/3 rounds，5/9 oranges**（partial credit by sticky `put_orange_to_plate`）
+  - 详见 [`vitorcen/isaaclab-experience`](https://github.com/vitorcen/isaaclab-experience) 的 `LeIsaac/README.md` benchmark section
+- **⚠️ 推理 inference 配置**：
+  - `policy_action_horizon=50`（= chunk_size，全 chunk receding window）
+  - LeRobot async server 端 `--policy_checkpoint_path=wsagi/SmolVLA2-PickOrange`
+  - `step_hz=30` 匹配 dataset
+## 模型亮点
+_Highlights_
+- SmolVLA 全参微调在 60 ep 小数据上**部分能学到**，1/3 round 自然 success（3/3 oranges in 158s）— 比第三方 [`edge-inference/smolvla-so101-pick-orange`](https://huggingface.co/edge-inference/smolvla-so101-pick-orange) 的 0/3 强。
+- 但 round 间方差大（episode 2 = 0/3，episode 3 = 2/3）— **60 ep × 30k step 仍欠拟合**。
+- 大参数 VLM-based policy 在低数据 regime 下不如专精 visuomotor (ACT 80M) — 与原 SmolVLA 论文低数据 finding 一致。
+## 训练配方
+_Training recipe_
+| 项 / Item | 值 / Value |
+|---|---|
+| Dataset | `LightwheelAI/leisaac-pick-orange` (60 ep, dual-cam 480×640 RGB + 6 DOF state, 30 Hz) |
+| Policy | `smolvla` (LeRobot 实现) |
+| Backbone | `HuggingFaceTB/SmolVLM2-500M-Video-Instruct` + Action Expert |
+| `chunk_size` / `n_action_steps` | 50 / 50 |
+| Batch size | 8 (full-param, no LoRA) |
+| Optimizer | AdamW, lr=1e-4 |
+| Steps | 30000 (~14h on 4090) |
+| `video_backend` | `pyav`（torchcodec 长跑 segfault） |
+| Image augmentation | 无 |
+| Train expert only | False（全参数） |
+> **🚨 schema-free base 关键 fix**：训练前必须用 [`prepare_base.sh`](https://github.com/vitorcen/LeIsaac-Training/blob/main/scripts/training/smolvla/prepare_base.sh) 剥光 `lerobot/smolvla_base` 自带的 `input_features` / `empty_cameras`（默认 `camera1/2/3 @ 256×256` 会污染微调路径），否则训练时 schema 不对齐 → forward 报 KeyError 或 silent 训坏。详见 [`smolvla2_finetune_pick_orange.html`](https://github.com/vitorcen/LeIsaac-Training/blob/main/docs/finetune/smolvla2_finetune_pick_orange.html)。
+## 推理 inference
+### 通过 LeIsaac eval harness 跑（推荐 / recommended）
+```bash
+# 1. 启 LeRobot async policy server
+bash server/start_server.sh --lerobot-only
+# 2. 跑 LeIsaac PickOrange eval
+DISPLAY=:0 python -u LeIsaac/scripts/evaluation/policy_inference.py \
+    --task=LeIsaac-SO101-PickOrange-v0 \
+    --eval_rounds=3 --episode_length_s=120 --step_hz=30 \
+    --policy_type=lerobot-smolvla \
+    --policy_host=127.0.0.1 --policy_port=8080 \
+    --policy_action_horizon=50 \
+    --policy_checkpoint_path=wsagi/SmolVLA2-PickOrange \
+    --policy_language_instruction='Pick up the orange and place it on the plate' \
+    --device=cuda --enable_cameras
+```
+### 直接用 LeRobot
+```python
+from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
+policy = SmolVLAPolicy.from_pretrained("wsagi/SmolVLA2-PickOrange")
+# 见 LeRobot 文档
+```
+## 评测细节（Isaac Sim 5.1，2026-05-18 snapshot）
+_Evaluation details_
+| Round | 🍊 placed | duration | mode | notes |
+|---|---|---|---|---|
+| 1 | 3/3 ✅ | 158.2 s | env-success | 自然完成 |
+| 2 | 0/3 | 551.7 s | key-R skip | 抓不中颤抖 |
+| 3 | 2/3 | 355.0 s | manual-hang | lerobot server 中断；2 是 viewport 观察 |
+**round-by-round detail + 1Hz GPU sample + 7-baseline 横评对比** 见 [`vitorcen/isaaclab-experience`](https://github.com/vitorcen/isaaclab-experience) 的 `results/benchmark/snapshots/`。
+## License
+Apache-2.0（继承自 `lerobot/smolvla_base` 和 LeIsaac）。

config.json ADDED Viewed

	@@ -0,0 +1,93 @@

+{
+    "type": "smolvla",
+    "n_obs_steps": 1,
+    "input_features": {
+        "observation.state": {
+            "type": "STATE",
+            "shape": [
+                6
+            ]
+        },
+        "observation.images.front": {
+            "type": "VISUAL",
+            "shape": [
+                3,
+                480,
+                640
+            ]
+        },
+        "observation.images.wrist": {
+            "type": "VISUAL",
+            "shape": [
+                3,
+                480,
+                640
+            ]
+        }
+    },
+    "output_features": {
+        "action": {
+            "type": "ACTION",
+            "shape": [
+                6
+            ]
+        }
+    },
+    "device": "cuda",
+    "use_amp": false,
+    "use_peft": false,
+    "push_to_hub": false,
+    "repo_id": null,
+    "private": null,
+    "tags": null,
+    "license": null,
+    "pretrained_path": "/home/david/work/LeIsaac/outputs/.bases/smolvla_base_no_features",
+    "chunk_size": 50,
+    "n_action_steps": 50,
+    "normalization_mapping": {
+        "VISUAL": "IDENTITY",
+        "STATE": "MEAN_STD",
+        "ACTION": "MEAN_STD"
+    },
+    "max_state_dim": 32,
+    "max_action_dim": 32,
+    "resize_imgs_with_padding": [
+        512,
+        512
+    ],
+    "empty_cameras": 0,
+    "adapt_to_pi_aloha": false,
+    "use_delta_joint_actions_aloha": false,
+    "tokenizer_max_length": 48,
+    "num_steps": 10,
+    "use_cache": true,
+    "freeze_vision_encoder": true,
+    "train_expert_only": true,
+    "train_state_proj": true,
+    "optimizer_lr": 0.0001,
+    "optimizer_betas": [
+        0.9,
+        0.95
+    ],
+    "optimizer_eps": 1e-08,
+    "optimizer_weight_decay": 1e-10,
+    "optimizer_grad_clip_norm": 10.0,
+    "scheduler_warmup_steps": 1000,
+    "scheduler_decay_steps": 30000,
+    "scheduler_decay_lr": 2.5e-06,
+    "vlm_model_name": "HuggingFaceTB/SmolVLM2-500M-Video-Instruct",
+    "load_vlm_weights": true,
+    "add_image_special_tokens": false,
+    "attention_mode": "cross_attn",
+    "prefix_length": 0,
+    "pad_language_to": "max_length",
+    "num_expert_layers": 0,
+    "num_vlm_layers": 16,
+    "self_attn_every_n_layers": 2,
+    "expert_width_multiplier": 0.75,
+    "min_period": 0.004,
+    "max_period": 4.0,
+    "rtc_config": null,
+    "compile_model": false,
+    "compile_mode": "max-autotune"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5180ce422a593d2e65c668782aa23069e4bbf6151758e08ccb6d0e0a16a3c587
+size 906712520

policy_postprocessor.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "name": "policy_postprocessor",
+  "steps": [
+    {
+      "registry_name": "unnormalizer_processor",
+      "config": {
+        "eps": 1e-08,
+        "features": {
+          "action": {
+            "type": "ACTION",
+            "shape": [
+              6
+            ]
+          }
+        },
+        "norm_map": {
+          "VISUAL": "IDENTITY",
+          "STATE": "MEAN_STD",
+          "ACTION": "MEAN_STD"
+        }
+      },
+      "state_file": "policy_postprocessor_step_0_unnormalizer_processor.safetensors"
+    },
+    {
+      "registry_name": "device_processor",
+      "config": {
+        "device": "cpu",
+        "float_dtype": null
+      }
+    }
+  ]
+}

policy_postprocessor_step_0_unnormalizer_processor.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0b9214914398156f17ddf08a5082c20b24efb70a7ba7677e8bc3d3ff68a722db
+size 3756

policy_preprocessor.json ADDED Viewed

	@@ -0,0 +1,79 @@

+{
+  "name": "policy_preprocessor",
+  "steps": [
+    {
+      "registry_name": "rename_observations_processor",
+      "config": {
+        "rename_map": {}
+      }
+    },
+    {
+      "registry_name": "to_batch_processor",
+      "config": {}
+    },
+    {
+      "registry_name": "smolvla_new_line_processor",
+      "config": {}
+    },
+    {
+      "registry_name": "tokenizer_processor",
+      "config": {
+        "max_length": 48,
+        "task_key": "task",
+        "padding_side": "right",
+        "padding": "max_length",
+        "truncation": true,
+        "tokenizer_name": "HuggingFaceTB/SmolVLM2-500M-Video-Instruct"
+      }
+    },
+    {
+      "registry_name": "device_processor",
+      "config": {
+        "device": "cuda",
+        "float_dtype": null
+      }
+    },
+    {
+      "registry_name": "normalizer_processor",
+      "config": {
+        "eps": 1e-08,
+        "features": {
+          "observation.state": {
+            "type": "STATE",
+            "shape": [
+              6
+            ]
+          },
+          "observation.images.front": {
+            "type": "VISUAL",
+            "shape": [
+              3,
+              480,
+              640
+            ]
+          },
+          "observation.images.wrist": {
+            "type": "VISUAL",
+            "shape": [
+              3,
+              480,
+              640
+            ]
+          },
+          "action": {
+            "type": "ACTION",
+            "shape": [
+              6
+            ]
+          }
+        },
+        "norm_map": {
+          "VISUAL": "IDENTITY",
+          "STATE": "MEAN_STD",
+          "ACTION": "MEAN_STD"
+        }
+      },
+      "state_file": "policy_preprocessor_step_5_normalizer_processor.safetensors"
+    }
+  ]
+}

policy_preprocessor_step_5_normalizer_processor.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0b9214914398156f17ddf08a5082c20b24efb70a7ba7677e8bc3d3ff68a722db
+size 3756

train_config.json ADDED Viewed

	@@ -0,0 +1,234 @@

+{
+    "dataset": {
+        "repo_id": "LightwheelAI/leisaac-pick-orange",
+        "root": "/home/david/work/LeIsaac/datasets/raw/leisaac-pick-orange",
+        "episodes": null,
+        "image_transforms": {
+            "enable": false,
+            "max_num_transforms": 3,
+            "random_order": false,
+            "tfs": {
+                "brightness": {
+                    "weight": 1.0,
+                    "type": "ColorJitter",
+                    "kwargs": {
+                        "brightness": [
+                            0.8,
+                            1.2
+                        ]
+                    }
+                },
+                "contrast": {
+                    "weight": 1.0,
+                    "type": "ColorJitter",
+                    "kwargs": {
+                        "contrast": [
+                            0.8,
+                            1.2
+                        ]
+                    }
+                },
+                "saturation": {
+                    "weight": 1.0,
+                    "type": "ColorJitter",
+                    "kwargs": {
+                        "saturation": [
+                            0.5,
+                            1.5
+                        ]
+                    }
+                },
+                "hue": {
+                    "weight": 1.0,
+                    "type": "ColorJitter",
+                    "kwargs": {
+                        "hue": [
+                            -0.05,
+                            0.05
+                        ]
+                    }
+                },
+                "sharpness": {
+                    "weight": 1.0,
+                    "type": "SharpnessJitter",
+                    "kwargs": {
+                        "sharpness": [
+                            0.5,
+                            1.5
+                        ]
+                    }
+                },
+                "affine": {
+                    "weight": 1.0,
+                    "type": "RandomAffine",
+                    "kwargs": {
+                        "degrees": [
+                            -5.0,
+                            5.0
+                        ],
+                        "translate": [
+                            0.05,
+                            0.05
+                        ]
+                    }
+                }
+            }
+        },
+        "revision": null,
+        "use_imagenet_stats": true,
+        "video_backend": "pyav",
+        "return_uint8": false,
+        "streaming": false
+    },
+    "env": null,
+    "policy": {
+        "type": "smolvla",
+        "n_obs_steps": 1,
+        "input_features": {
+            "observation.state": {
+                "type": "STATE",
+                "shape": [
+                    6
+                ]
+            },
+            "observation.images.front": {
+                "type": "VISUAL",
+                "shape": [
+                    3,
+                    480,
+                    640
+                ]
+            },
+            "observation.images.wrist": {
+                "type": "VISUAL",
+                "shape": [
+                    3,
+                    480,
+                    640
+                ]
+            }
+        },
+        "output_features": {
+            "action": {
+                "type": "ACTION",
+                "shape": [
+                    6
+                ]
+            }
+        },
+        "device": "cuda",
+        "use_amp": false,
+        "use_peft": false,
+        "push_to_hub": false,
+        "repo_id": null,
+        "private": null,
+        "tags": null,
+        "license": null,
+        "pretrained_path": "/home/david/work/LeIsaac/outputs/.bases/smolvla_base_no_features",
+        "chunk_size": 50,
+        "n_action_steps": 50,
+        "normalization_mapping": {
+            "VISUAL": "IDENTITY",
+            "STATE": "MEAN_STD",
+            "ACTION": "MEAN_STD"
+        },
+        "max_state_dim": 32,
+        "max_action_dim": 32,
+        "resize_imgs_with_padding": [
+            512,
+            512
+        ],
+        "empty_cameras": 0,
+        "adapt_to_pi_aloha": false,
+        "use_delta_joint_actions_aloha": false,
+        "tokenizer_max_length": 48,
+        "num_steps": 10,
+        "use_cache": true,
+        "freeze_vision_encoder": true,
+        "train_expert_only": true,
+        "train_state_proj": true,
+        "optimizer_lr": 0.0001,
+        "optimizer_betas": [
+            0.9,
+            0.95
+        ],
+        "optimizer_eps": 1e-08,
+        "optimizer_weight_decay": 1e-10,
+        "optimizer_grad_clip_norm": 10.0,
+        "scheduler_warmup_steps": 1000,
+        "scheduler_decay_steps": 30000,
+        "scheduler_decay_lr": 2.5e-06,
+        "vlm_model_name": "HuggingFaceTB/SmolVLM2-500M-Video-Instruct",
+        "load_vlm_weights": true,
+        "add_image_special_tokens": false,
+        "attention_mode": "cross_attn",
+        "prefix_length": 0,
+        "pad_language_to": "max_length",
+        "num_expert_layers": 0,
+        "num_vlm_layers": 16,
+        "self_attn_every_n_layers": 2,
+        "expert_width_multiplier": 0.75,
+        "min_period": 0.004,
+        "max_period": 4.0,
+        "rtc_config": null,
+        "compile_model": false,
+        "compile_mode": "max-autotune"
+    },
+    "output_dir": "/home/david/work/LeIsaac/outputs/smolvla2-leisaac-pick-orange",
+    "job_name": "smolvla",
+    "resume": false,
+    "seed": 1000,
+    "cudnn_deterministic": false,
+    "num_workers": 2,
+    "batch_size": 8,
+    "prefetch_factor": 4,
+    "persistent_workers": true,
+    "steps": 30000,
+    "eval_freq": 20000,
+    "log_freq": 200,
+    "tolerance_s": 0.0001,
+    "save_checkpoint": true,
+    "save_freq": 5000,
+    "use_policy_training_preset": true,
+    "optimizer": {
+        "type": "adamw",
+        "lr": 0.0001,
+        "weight_decay": 1e-10,
+        "grad_clip_norm": 10.0,
+        "betas": [
+            0.9,
+            0.95
+        ],
+        "eps": 1e-08
+    },
+    "scheduler": {
+        "type": "cosine_decay_with_warmup",
+        "num_warmup_steps": 1000,
+        "num_decay_steps": 30000,
+        "peak_lr": 0.0001,
+        "decay_lr": 2.5e-06
+    },
+    "eval": {
+        "n_episodes": 50,
+        "batch_size": 22,
+        "use_async_envs": true
+    },
+    "wandb": {
+        "enable": false,
+        "disable_artifact": false,
+        "project": "lerobot",
+        "entity": null,
+        "notes": null,
+        "run_id": null,
+        "mode": null,
+        "add_tags": true
+    },
+    "peft": null,
+    "use_rabc": false,
+    "rabc_progress_path": null,
+    "rabc_kappa": 0.01,
+    "rabc_epsilon": 1e-06,
+    "rabc_head_mode": "sparse",
+    "rename_map": {},
+    "checkpoint_path": null
+}