SmolVLA SO101 PickOrange

Fine-tuned SmolVLA policy for the SO101 robot arm performing an orange-picking task in LeIsaac (Isaac Sim).

Task

Pick three oranges from the table and place them on the plate, then reset the arm to rest state. Evaluated in the LeIsaac-SO101-PickOrange-v0 Isaac Sim environment.

Architecture

SmolVLA is a Vision-Language-Action model that combines a frozen vision encoder with a language model backbone and a lightweight action expert head.

Inference pipeline

2 camera images (480x640)
  -> resize to 512x512 with padding
  -> patchify into 16x16 patches (1024 tokens per image, 2048 total)
  -> 12-layer ViT vision encoder (bf16)
  -> project to LM hidden space
  -> 16-layer SmolLM2 backbone + 16-layer Expert (interleaved cross-attention)
  -> decode 50 action tokens -> 6D joint positions

Vision Encoder (ViT)

Property Value
Architecture Vision Transformer (SigLIP-derived)
Hidden size 768
Layers 12
Attention heads 12
Patch size 16x16
Input resolution 512x512
Tokens per image 1024 (32x32 patches)
Precision bfloat16
Status (training) Frozen

Text/LM Backbone (SmolLM2)

Property Value
Architecture SmolLM2 (Llama-based)
Hidden size 960
Layers 16 (truncated from 32)
Attention heads 15
Intermediate size 2560
Vocab size 49,280

Action Expert Head

Property Value
Layers 16 (matches truncated VLM)
Hidden size 720 (0.75x VLM hidden)
Attention mode Cross-attention (interleaved with VLM layers)
Output 50 action chunks x 6D
Trainable params 100M

Full Model Summary

Component Params Trainable Precision
Vision encoder (ViT) ~86M Frozen bf16
LM backbone (SmolLM2) ~264M Frozen bf16
Action expert head ~100M Yes bf16
Total 450M 100M bf16

Branches

Branch Training Batch size Final loss
main multi-rank, 30k steps 56 0.019
single-rank single-rank, 30k steps 64 0.008

Training

Parameter Value
Dataset LightwheelAI/leisaac-pick-orange (sim-collected)
Episodes 60
Frames 36,293
Steps 30,000
Batch size 56 effective (main) / 64 (single-rank)
Learning rate 1e-4 (cosine decay with 1k warmup)
Optimizer AdamW (betas=0.9/0.95, eps=1e-8, weight_decay=1e-10)
Scheduler Cosine decay, 1000 warmup steps, decay to 2.5e-6
Grad clip norm 10
VLM layers 16 (truncated from 32)
Vision encoder Frozen
Training mode Expert-only (train_expert_only=true)
Framework LeRobot v0.4.5

Usage

Serve the policy

python -m lerobot.scripts.serve \
  --policy.type=smolvla \
  --policy.pretrained_path=edge-inference/smolvla-so101-pick-orange \
  --policy.vlm_model_name=HuggingFaceTB/SmolVLM2-500M-Video-Instruct \
  --port=8080

Evaluate in Isaac Sim (requires LeIsaac + Isaac Sim)

python scripts/evaluation/policy_inference.py \
  --task=LeIsaac-SO101-PickOrange-v0 \
  --policy_type=lerobot-smolvla \
  --policy_host=localhost \
  --policy_port=8080 \
  --policy_checkpoint_path=edge-inference/smolvla-so101-pick-orange \
  --policy_action_horizon=50 \
  --policy_language_instruction="Pick up the orange and place it on the plate" \
  --eval_rounds=10 \
  --device=cuda \
  --enable_cameras

Use the single-rank branch (lower loss)

from huggingface_hub import snapshot_download
snapshot_download(
    "edge-inference/smolvla-so101-pick-orange",
    revision="single-rank",
    local_dir="./checkpoint-single-rank"
)

Dataset

The training data was collected via teleoperation inside the LeIsaac simulation (Isaac Sim), meaning there is zero visual domain gap between training and evaluation environments.

Files

  • model.safetensors -- Model weights (1.2 GB)
  • config.json -- Policy architecture config
  • train_config.json -- Full training configuration (reproducible)
  • policy_preprocessor*.json/safetensors -- Input normalization (state mean/std)
  • policy_postprocessor*.json/safetensors -- Output denormalization (action mean/std)
Downloads last month
105
Video Preview
loading

Model tree for edge-inference/smolvla-so101-pick-orange

Finetuned
(4980)
this model

Dataset used to train edge-inference/smolvla-so101-pick-orange