SmolVLA Arena — GR1 Microwave (ONNX + QNN)

Optimized inference models for SmolVLA on the GR1 Microwave manipulation task in IsaacLab Arena.

Overview

This repository contains the SmolVLA VLA (Vision-Language-Action) model split into 3 components and converted to ONNX and QNN formats for efficient edge deployment:

Component	Description	ONNX I/O	Params
Vision Encoder	SigLIP ViT + PixelShuffle connector	`[1,3,512,512]` → `[1,64,960]`	~98M
LLM Backbone	SmolLM2 embed + 16 transformer layers	`img_emb + lang + state` → `KV cache`	~196M
Action Head	16 expert layers (self-attn + cross-attn)	`noise + KV cache` → `actions`	~100M

Evaluation Results

Backend	Success Rate	Eval Time (50 ep)	Notes
PyTorch (bf16, CUDA)	100%	~840s	Baseline
ONNX Runtime (fp32, CUDA)	100%	~893s	Static shapes, opset 17

Cosine Similarity (ONNX vs PyTorch)

Component	Cosine Sim	Max Abs Diff	Mean Abs Diff
Vision Encoder	1.000000	3.49e-04	6.26e-05
LLM Backbone (KV-K)	1.000000	3.28e-06	3.60e-07
LLM Backbone (KV-V)	1.000000	3.34e-06	3.59e-07
Action Head	1.000000	5.01e-06	7.54e-07

Repository Structure

.
├── onnx/                          # ONNX models (opset 17, float32, static shapes)
│   ├── vision_encoder.onnx        # SigLIP ViT + connector
│   ├── llm_backbone.onnx          # SmolLM2 prefix encoder
│   └── action_head.onnx           # Expert denoising network
├── qnn/                           # QNN models (QAIRT SDK 2.43.0)
│   ├── vision_encoder.cpp/.bin    # QNN graph + weights
│   ├── llm_backbone.cpp/.bin
│   └── action_head.cpp/.bin
├── scripts/
│   ├── export_to_onnx.py          # PyTorch → ONNX conversion script
│   ├── validate_onnx.py           # ONNX vs PyTorch cosine similarity
│   ├── inference_onnx.py          # ONNX Runtime inference pipeline
│   ├── eval_onnx_isaaclab.py      # IsaacLab Arena evaluation with ONNX
│   └── eval_microwave.sh          # Shell script for quick eval
├── split_models/                  # PyTorch model wrappers
│   ├── vision_encoder.py          # VisionEncoderWrapper
│   ├── llm_backbone.py            # LLMBackboneWrapper
│   ├── action_head.py             # ActionHeadWrapper
│   ├── inference_pipeline.py      # End-to-end PyTorch pipeline
│   └── validate_split.py          # Split model validation
└── README.md

Quick Start

Prerequisites

# Create conda environment
conda create -n smolvla python=3.11 -y
conda activate smolvla

# Install Isaac Sim + IsaacLab (for simulation eval)
pip install isaacsim==5.1.0 --extra-index-url https://pypi.nvidia.com
pip install isaacsim-extscache-physics==5.1.0 isaacsim-extscache-kit==5.1.0 \
    isaacsim-extscache-kit-sdk==5.1.0 --extra-index-url https://pypi.nvidia.com
pip install isaaclab==2.3.0 --extra-index-url https://pypi.nvidia.com

# Install IsaacLab Arena
pip install isaaclab-arena==0.1.1

# Install lerobot with SmolVLA
pip install lerobot[smolvla]==0.4.4

# Install ONNX Runtime
pip install onnxruntime-gpu==1.23.2 onnx==1.18.0

# Accept EULAs
export ACCEPT_EULA=Y PRIVACY_CONSENT=Y OMNI_KIT_ACCEPT_EULA=YES

ONNX Inference (Standalone)

import numpy as np
import onnxruntime as ort

# Load 3 ONNX sessions
providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
vision_sess = ort.InferenceSession('onnx/vision_encoder.onnx', providers=providers)
llm_sess = ort.InferenceSession('onnx/llm_backbone.onnx', providers=providers)
action_sess = ort.InferenceSession('onnx/action_head.onnx', providers=providers)

# Inputs
pixel_values = np.random.randn(1, 3, 512, 512).astype(np.float32)  # [-1, 1] normalized
lang_tokens = np.zeros((1, 48), dtype=np.int64)
lang_masks = np.ones((1, 48), dtype=np.int64)
state = np.random.randn(1, 54).astype(np.float32)

# Step 1: Vision encoding
[img_emb] = vision_sess.run(None, {'pixel_values': pixel_values})

# Step 2: LLM backbone (prefix encoding → KV cache)
kv_k, kv_v, pad_masks = llm_sess.run(None, {
    'img_emb': img_emb,
    'lang_tokens': lang_tokens,
    'lang_masks': lang_masks,
    'state': state,
})

# Step 3: Denoising loop (10 steps, flow matching)
x_t = np.random.randn(1, 50, 36).astype(np.float32)
dt = -0.1
for step in range(10):
    t = 1.0 + step * dt
    [v_t] = action_sess.run(None, {
        'x_t': x_t,
        'timestep': np.array([t], dtype=np.float32),
        'kv_cache_k': kv_k,
        'kv_cache_v': kv_v,
        'prefix_pad_masks': pad_masks,
    })
    x_t = x_t + dt * v_t

actions = x_t  # [1, 50, 36] — 50-step action trajectory

IsaacLab Arena Evaluation

# PyTorch baseline
./scripts/eval_microwave.sh

# ONNX evaluation
python scripts/eval_onnx_isaaclab.py

# RTX GPU rendering (requires RTX 4080+)
./scripts/eval_microwave.sh --rtx_rendering

Model Architecture

SmolVLA is a Vision-Language-Action model for robot manipulation:

Image [1,3,512,512]     Language tokens [1,48]    Robot state [1,54]
        │                       │                        │
   ┌────▼────┐                  │                        │
   │ SigLIP  │                  │                        │
   │  ViT    │                  │                        │
   │ (ViT-L) │                  │                        │
   └────┬────┘                  │                        │
        │ [1,1024,768]          │                        │
   ┌────▼────────┐              │                        │
   │ PixelShuffle│              │                        │
   │ + MLP       │              │                        │
   └────┬────────┘              │                        │
        │ [1,64,960]            │                        │
        └────────┬──────────────┴────────────────────────┘
                 │
        ┌────────▼────────┐
        │   SmolLM2       │ (16 transformer layers)
        │   Backbone      │
        └────────┬────────┘
                 │ KV cache [16,1,113,5,64]
                 │
        ┌────────▼────────┐
        │   Expert Head   │ (16 expert layers × 10 denoising steps)
        │   Flow Matching │
        └────────┬────────┘
                 │
           Actions [1,50,36]

Key dimensions:

VLM hidden: 960, Expert hidden: 720 (0.75×)
Attention: 15 Q-heads, 5 KV-heads, head_dim=64
Prefix length: 113 (64 img + 48 lang + 1 state)
Denoising: 10 steps, flow matching (time: 1.0 → 0.0)

ONNX Conversion Details

Key Challenges Solved

ScatterND bypass: SigLIP VisionEmbeddings uses bucketize + ScatterND for position IDs. Bypassed by precomputing position IDs for static 512×512 input.
CumSum on bool: ONNX doesn't support CumSum on bool tensors. Monkey-patched torch.cumsum to auto-cast bool → int64 before export.
bfloat16 → float32: ONNX opset 17 has limited bfloat16 support. All weights converted to float32.
Flash Attention → Eager: Forced _attn_implementation="eager" to avoid SDPA/Flash Attention export failures.
Device mismatch: Legacy TorchScript exporter has constant folding issues on CUDA. Exported on CPU with dynamo=False.

Reproducing the Conversion

# From the original PyTorch model:
python scripts/export_to_onnx.py

# Validate cosine similarity:
python scripts/validate_onnx.py

QNN Conversion

Prerequisites

Install the Qualcomm AI Engine Direct SDK (QAIRT) v2.43.0+:

# Set up QAIRT SDK environment
export QNN_SDK_ROOT=/path/to/qairt/2.43.0.260128

# Create Python 3.10 environment (required by QAIRT)
conda create -n qairt python=3.10 -y
conda activate qairt
pip install numpy==1.26.0 onnx==1.16.1 pyyaml pandas protobuf

# Install libc++ and libunwind
conda install -c conda-forge libcxx libunwind -y
ln -sf $CONDA_PREFIX/lib/libunwind.so.8 $CONDA_PREFIX/lib/libunwind.so.1

# Set library paths
export PYTHONPATH="$QNN_SDK_ROOT/lib/python"
export LD_LIBRARY_PATH="$CONDA_PREFIX/lib:$QNN_SDK_ROOT/lib/x86_64-linux-clang:$LD_LIBRARY_PATH"

Converting ONNX to QNN

# Convert each model
for model in vision_encoder llm_backbone action_head; do
    $QNN_SDK_ROOT/bin/x86_64-linux-clang/qnn-onnx-converter \
        -i onnx/${model}.onnx \
        -o qnn/${model}
done

# Generate model libraries (x86_64)
for model in vision_encoder llm_backbone action_head; do
    $QNN_SDK_ROOT/bin/x86_64-linux-clang/qnn-model-lib-generator \
        -c qnn/${model}.cpp \
        -b qnn/${model}.bin \
        -o qnn/ \
        -t x86_64-linux-clang
done

Generating HTP Context Binary (on-device)

# Requires Snapdragon target device or HTP emulator
$QNN_SDK_ROOT/bin/x86_64-linux-clang/qnn-context-binary-generator \
    --model qnn/vision_encoder.so \
    --backend $QNN_SDK_ROOT/lib/x86_64-linux-clang/libQnnHtp.so \
    --output_dir qnn/

GR1 Microwave Task

The GR1 Microwave task in IsaacLab Arena involves a humanoid robot (GR1 Pink) manipulating a microwave oven with a mustard bottle:

Robot: Fourier GR1 humanoid (54 joint positions)
Task: Open microwave, place bottle, close microwave
Action space: 36-dim joint velocity commands
Observation: RGB camera (512×512) + joint positions (54-dim)
Success criteria: Mustard bottle placed in microwave

Citation

@article{smolvla2025,
    title={SmolVLA: A Small Vision-Language-Action Model for Efficient Robot Learning},
    author={HuggingFace Team},
    year={2025},
    url={https://huggingface.co/papers/2506.01844}
}

License

This model conversion follows the license of the original nvidia/smolvla-arena-gr1-microwave model.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for xpuenabler/smolvla-arena-gr1-microwave-QNN-CPU-W16A16

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Paper • 2506.01844 • Published Jun 2, 2025 • 158