YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
SmolVLA Arena β GR1 Microwave (ONNX + QNN)
Optimized inference models for SmolVLA on the GR1 Microwave manipulation task in IsaacLab Arena.
Overview
This repository contains the SmolVLA VLA (Vision-Language-Action) model split into 3 components and converted to ONNX and QNN formats for efficient edge deployment:
| Component | Description | ONNX I/O | Params |
|---|---|---|---|
| Vision Encoder | SigLIP ViT + PixelShuffle connector | [1,3,512,512] β [1,64,960] |
~98M |
| LLM Backbone | SmolLM2 embed + 16 transformer layers | img_emb + lang + state β KV cache |
~196M |
| Action Head | 16 expert layers (self-attn + cross-attn) | noise + KV cache β actions |
~100M |
Evaluation Results
| Backend | Success Rate | Eval Time (50 ep) | Notes |
|---|---|---|---|
| PyTorch (bf16, CUDA) | 100% | ~840s | Baseline |
| ONNX Runtime (fp32, CUDA) | 100% | ~893s | Static shapes, opset 17 |
Cosine Similarity (ONNX vs PyTorch)
| Component | Cosine Sim | Max Abs Diff | Mean Abs Diff |
|---|---|---|---|
| Vision Encoder | 1.000000 | 3.49e-04 | 6.26e-05 |
| LLM Backbone (KV-K) | 1.000000 | 3.28e-06 | 3.60e-07 |
| LLM Backbone (KV-V) | 1.000000 | 3.34e-06 | 3.59e-07 |
| Action Head | 1.000000 | 5.01e-06 | 7.54e-07 |
Repository Structure
.
βββ onnx/ # ONNX models (opset 17, float32, static shapes)
β βββ vision_encoder.onnx # SigLIP ViT + connector
β βββ llm_backbone.onnx # SmolLM2 prefix encoder
β βββ action_head.onnx # Expert denoising network
βββ qnn/ # QNN models (QAIRT SDK 2.43.0)
β βββ vision_encoder.cpp/.bin # QNN graph + weights
β βββ llm_backbone.cpp/.bin
β βββ action_head.cpp/.bin
βββ scripts/
β βββ export_to_onnx.py # PyTorch β ONNX conversion script
β βββ validate_onnx.py # ONNX vs PyTorch cosine similarity
β βββ inference_onnx.py # ONNX Runtime inference pipeline
β βββ eval_onnx_isaaclab.py # IsaacLab Arena evaluation with ONNX
β βββ eval_microwave.sh # Shell script for quick eval
βββ split_models/ # PyTorch model wrappers
β βββ vision_encoder.py # VisionEncoderWrapper
β βββ llm_backbone.py # LLMBackboneWrapper
β βββ action_head.py # ActionHeadWrapper
β βββ inference_pipeline.py # End-to-end PyTorch pipeline
β βββ validate_split.py # Split model validation
βββ README.md
Quick Start
Prerequisites
# Create conda environment
conda create -n smolvla python=3.11 -y
conda activate smolvla
# Install Isaac Sim + IsaacLab (for simulation eval)
pip install isaacsim==5.1.0 --extra-index-url https://pypi.nvidia.com
pip install isaacsim-extscache-physics==5.1.0 isaacsim-extscache-kit==5.1.0 \
isaacsim-extscache-kit-sdk==5.1.0 --extra-index-url https://pypi.nvidia.com
pip install isaaclab==2.3.0 --extra-index-url https://pypi.nvidia.com
# Install IsaacLab Arena
pip install isaaclab-arena==0.1.1
# Install lerobot with SmolVLA
pip install lerobot[smolvla]==0.4.4
# Install ONNX Runtime
pip install onnxruntime-gpu==1.23.2 onnx==1.18.0
# Accept EULAs
export ACCEPT_EULA=Y PRIVACY_CONSENT=Y OMNI_KIT_ACCEPT_EULA=YES
ONNX Inference (Standalone)
import numpy as np
import onnxruntime as ort
# Load 3 ONNX sessions
providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
vision_sess = ort.InferenceSession('onnx/vision_encoder.onnx', providers=providers)
llm_sess = ort.InferenceSession('onnx/llm_backbone.onnx', providers=providers)
action_sess = ort.InferenceSession('onnx/action_head.onnx', providers=providers)
# Inputs
pixel_values = np.random.randn(1, 3, 512, 512).astype(np.float32) # [-1, 1] normalized
lang_tokens = np.zeros((1, 48), dtype=np.int64)
lang_masks = np.ones((1, 48), dtype=np.int64)
state = np.random.randn(1, 54).astype(np.float32)
# Step 1: Vision encoding
[img_emb] = vision_sess.run(None, {'pixel_values': pixel_values})
# Step 2: LLM backbone (prefix encoding β KV cache)
kv_k, kv_v, pad_masks = llm_sess.run(None, {
'img_emb': img_emb,
'lang_tokens': lang_tokens,
'lang_masks': lang_masks,
'state': state,
})
# Step 3: Denoising loop (10 steps, flow matching)
x_t = np.random.randn(1, 50, 36).astype(np.float32)
dt = -0.1
for step in range(10):
t = 1.0 + step * dt
[v_t] = action_sess.run(None, {
'x_t': x_t,
'timestep': np.array([t], dtype=np.float32),
'kv_cache_k': kv_k,
'kv_cache_v': kv_v,
'prefix_pad_masks': pad_masks,
})
x_t = x_t + dt * v_t
actions = x_t # [1, 50, 36] β 50-step action trajectory
IsaacLab Arena Evaluation
# PyTorch baseline
./scripts/eval_microwave.sh
# ONNX evaluation
python scripts/eval_onnx_isaaclab.py
# RTX GPU rendering (requires RTX 4080+)
./scripts/eval_microwave.sh --rtx_rendering
Model Architecture
SmolVLA is a Vision-Language-Action model for robot manipulation:
Image [1,3,512,512] Language tokens [1,48] Robot state [1,54]
β β β
ββββββΌβββββ β β
β SigLIP β β β
β ViT β β β
β (ViT-L) β β β
ββββββ¬βββββ β β
β [1,1024,768] β β
ββββββΌβββββββββ β β
β PixelShuffleβ β β
β + MLP β β β
ββββββ¬βββββββββ β β
β [1,64,960] β β
ββββββββββ¬βββββββββββββββ΄βββββββββββββββββββββββββ
β
ββββββββββΌβββββββββ
β SmolLM2 β (16 transformer layers)
β Backbone β
ββββββββββ¬βββββββββ
β KV cache [16,1,113,5,64]
β
ββββββββββΌβββββββββ
β Expert Head β (16 expert layers Γ 10 denoising steps)
β Flow Matching β
ββββββββββ¬βββββββββ
β
Actions [1,50,36]
Key dimensions:
- VLM hidden: 960, Expert hidden: 720 (0.75Γ)
- Attention: 15 Q-heads, 5 KV-heads, head_dim=64
- Prefix length: 113 (64 img + 48 lang + 1 state)
- Denoising: 10 steps, flow matching (time: 1.0 β 0.0)
ONNX Conversion Details
Key Challenges Solved
ScatterND bypass: SigLIP VisionEmbeddings uses
bucketize+ ScatterND for position IDs. Bypassed by precomputing position IDs for static 512Γ512 input.CumSum on bool: ONNX doesn't support CumSum on bool tensors. Monkey-patched
torch.cumsumto auto-cast bool β int64 before export.bfloat16 β float32: ONNX opset 17 has limited bfloat16 support. All weights converted to float32.
Flash Attention β Eager: Forced
_attn_implementation="eager"to avoid SDPA/Flash Attention export failures.Device mismatch: Legacy TorchScript exporter has constant folding issues on CUDA. Exported on CPU with
dynamo=False.
Reproducing the Conversion
# From the original PyTorch model:
python scripts/export_to_onnx.py
# Validate cosine similarity:
python scripts/validate_onnx.py
QNN Conversion
Prerequisites
Install the Qualcomm AI Engine Direct SDK (QAIRT) v2.43.0+:
# Set up QAIRT SDK environment
export QNN_SDK_ROOT=/path/to/qairt/2.43.0.260128
# Create Python 3.10 environment (required by QAIRT)
conda create -n qairt python=3.10 -y
conda activate qairt
pip install numpy==1.26.0 onnx==1.16.1 pyyaml pandas protobuf
# Install libc++ and libunwind
conda install -c conda-forge libcxx libunwind -y
ln -sf $CONDA_PREFIX/lib/libunwind.so.8 $CONDA_PREFIX/lib/libunwind.so.1
# Set library paths
export PYTHONPATH="$QNN_SDK_ROOT/lib/python"
export LD_LIBRARY_PATH="$CONDA_PREFIX/lib:$QNN_SDK_ROOT/lib/x86_64-linux-clang:$LD_LIBRARY_PATH"
Converting ONNX to QNN
# Convert each model
for model in vision_encoder llm_backbone action_head; do
$QNN_SDK_ROOT/bin/x86_64-linux-clang/qnn-onnx-converter \
-i onnx/${model}.onnx \
-o qnn/${model}
done
# Generate model libraries (x86_64)
for model in vision_encoder llm_backbone action_head; do
$QNN_SDK_ROOT/bin/x86_64-linux-clang/qnn-model-lib-generator \
-c qnn/${model}.cpp \
-b qnn/${model}.bin \
-o qnn/ \
-t x86_64-linux-clang
done
Generating HTP Context Binary (on-device)
# Requires Snapdragon target device or HTP emulator
$QNN_SDK_ROOT/bin/x86_64-linux-clang/qnn-context-binary-generator \
--model qnn/vision_encoder.so \
--backend $QNN_SDK_ROOT/lib/x86_64-linux-clang/libQnnHtp.so \
--output_dir qnn/
GR1 Microwave Task
The GR1 Microwave task in IsaacLab Arena involves a humanoid robot (GR1 Pink) manipulating a microwave oven with a mustard bottle:
- Robot: Fourier GR1 humanoid (54 joint positions)
- Task: Open microwave, place bottle, close microwave
- Action space: 36-dim joint velocity commands
- Observation: RGB camera (512Γ512) + joint positions (54-dim)
- Success criteria: Mustard bottle placed in microwave
Citation
@article{smolvla2025,
title={SmolVLA: A Small Vision-Language-Action Model for Efficient Robot Learning},
author={HuggingFace Team},
year={2025},
url={https://huggingface.co/papers/2506.01844}
}
License
This model conversion follows the license of the original nvidia/smolvla-arena-gr1-microwave model.