YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

SmolVLA Arena β€” GR1 Microwave (ONNX + QNN)

Optimized inference models for SmolVLA on the GR1 Microwave manipulation task in IsaacLab Arena.

Overview

This repository contains the SmolVLA VLA (Vision-Language-Action) model split into 3 components and converted to ONNX and QNN formats for efficient edge deployment:

Component Description ONNX I/O Params
Vision Encoder SigLIP ViT + PixelShuffle connector [1,3,512,512] β†’ [1,64,960] ~98M
LLM Backbone SmolLM2 embed + 16 transformer layers img_emb + lang + state β†’ KV cache ~196M
Action Head 16 expert layers (self-attn + cross-attn) noise + KV cache β†’ actions ~100M

Evaluation Results

Backend Success Rate Eval Time (50 ep) Notes
PyTorch (bf16, CUDA) 100% ~840s Baseline
ONNX Runtime (fp32, CUDA) 100% ~893s Static shapes, opset 17

Cosine Similarity (ONNX vs PyTorch)

Component Cosine Sim Max Abs Diff Mean Abs Diff
Vision Encoder 1.000000 3.49e-04 6.26e-05
LLM Backbone (KV-K) 1.000000 3.28e-06 3.60e-07
LLM Backbone (KV-V) 1.000000 3.34e-06 3.59e-07
Action Head 1.000000 5.01e-06 7.54e-07

Repository Structure

.
β”œβ”€β”€ onnx/                          # ONNX models (opset 17, float32, static shapes)
β”‚   β”œβ”€β”€ vision_encoder.onnx        # SigLIP ViT + connector
β”‚   β”œβ”€β”€ llm_backbone.onnx          # SmolLM2 prefix encoder
β”‚   └── action_head.onnx           # Expert denoising network
β”œβ”€β”€ qnn/                           # QNN models (QAIRT SDK 2.43.0)
β”‚   β”œβ”€β”€ vision_encoder.cpp/.bin    # QNN graph + weights
β”‚   β”œβ”€β”€ llm_backbone.cpp/.bin
β”‚   └── action_head.cpp/.bin
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ export_to_onnx.py          # PyTorch β†’ ONNX conversion script
β”‚   β”œβ”€β”€ validate_onnx.py           # ONNX vs PyTorch cosine similarity
β”‚   β”œβ”€β”€ inference_onnx.py          # ONNX Runtime inference pipeline
β”‚   β”œβ”€β”€ eval_onnx_isaaclab.py      # IsaacLab Arena evaluation with ONNX
β”‚   └── eval_microwave.sh          # Shell script for quick eval
β”œβ”€β”€ split_models/                  # PyTorch model wrappers
β”‚   β”œβ”€β”€ vision_encoder.py          # VisionEncoderWrapper
β”‚   β”œβ”€β”€ llm_backbone.py            # LLMBackboneWrapper
β”‚   β”œβ”€β”€ action_head.py             # ActionHeadWrapper
β”‚   β”œβ”€β”€ inference_pipeline.py      # End-to-end PyTorch pipeline
β”‚   └── validate_split.py          # Split model validation
└── README.md

Quick Start

Prerequisites

# Create conda environment
conda create -n smolvla python=3.11 -y
conda activate smolvla

# Install Isaac Sim + IsaacLab (for simulation eval)
pip install isaacsim==5.1.0 --extra-index-url https://pypi.nvidia.com
pip install isaacsim-extscache-physics==5.1.0 isaacsim-extscache-kit==5.1.0 \
    isaacsim-extscache-kit-sdk==5.1.0 --extra-index-url https://pypi.nvidia.com
pip install isaaclab==2.3.0 --extra-index-url https://pypi.nvidia.com

# Install IsaacLab Arena
pip install isaaclab-arena==0.1.1

# Install lerobot with SmolVLA
pip install lerobot[smolvla]==0.4.4

# Install ONNX Runtime
pip install onnxruntime-gpu==1.23.2 onnx==1.18.0

# Accept EULAs
export ACCEPT_EULA=Y PRIVACY_CONSENT=Y OMNI_KIT_ACCEPT_EULA=YES

ONNX Inference (Standalone)

import numpy as np
import onnxruntime as ort

# Load 3 ONNX sessions
providers = ['CUDAExecutionProvider', 'CPUExecutionProvider']
vision_sess = ort.InferenceSession('onnx/vision_encoder.onnx', providers=providers)
llm_sess = ort.InferenceSession('onnx/llm_backbone.onnx', providers=providers)
action_sess = ort.InferenceSession('onnx/action_head.onnx', providers=providers)

# Inputs
pixel_values = np.random.randn(1, 3, 512, 512).astype(np.float32)  # [-1, 1] normalized
lang_tokens = np.zeros((1, 48), dtype=np.int64)
lang_masks = np.ones((1, 48), dtype=np.int64)
state = np.random.randn(1, 54).astype(np.float32)

# Step 1: Vision encoding
[img_emb] = vision_sess.run(None, {'pixel_values': pixel_values})

# Step 2: LLM backbone (prefix encoding β†’ KV cache)
kv_k, kv_v, pad_masks = llm_sess.run(None, {
    'img_emb': img_emb,
    'lang_tokens': lang_tokens,
    'lang_masks': lang_masks,
    'state': state,
})

# Step 3: Denoising loop (10 steps, flow matching)
x_t = np.random.randn(1, 50, 36).astype(np.float32)
dt = -0.1
for step in range(10):
    t = 1.0 + step * dt
    [v_t] = action_sess.run(None, {
        'x_t': x_t,
        'timestep': np.array([t], dtype=np.float32),
        'kv_cache_k': kv_k,
        'kv_cache_v': kv_v,
        'prefix_pad_masks': pad_masks,
    })
    x_t = x_t + dt * v_t

actions = x_t  # [1, 50, 36] β€” 50-step action trajectory

IsaacLab Arena Evaluation

# PyTorch baseline
./scripts/eval_microwave.sh

# ONNX evaluation
python scripts/eval_onnx_isaaclab.py

# RTX GPU rendering (requires RTX 4080+)
./scripts/eval_microwave.sh --rtx_rendering

Model Architecture

SmolVLA is a Vision-Language-Action model for robot manipulation:

Image [1,3,512,512]     Language tokens [1,48]    Robot state [1,54]
        β”‚                       β”‚                        β”‚
   β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”                  β”‚                        β”‚
   β”‚ SigLIP  β”‚                  β”‚                        β”‚
   β”‚  ViT    β”‚                  β”‚                        β”‚
   β”‚ (ViT-L) β”‚                  β”‚                        β”‚
   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜                  β”‚                        β”‚
        β”‚ [1,1024,768]          β”‚                        β”‚
   β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”              β”‚                        β”‚
   β”‚ PixelShuffleβ”‚              β”‚                        β”‚
   β”‚ + MLP       β”‚              β”‚                        β”‚
   β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β”‚                        β”‚
        β”‚ [1,64,960]            β”‚                        β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚   SmolLM2       β”‚ (16 transformer layers)
        β”‚   Backbone      β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚ KV cache [16,1,113,5,64]
                 β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚   Expert Head   β”‚ (16 expert layers Γ— 10 denoising steps)
        β”‚   Flow Matching β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚
           Actions [1,50,36]

Key dimensions:

  • VLM hidden: 960, Expert hidden: 720 (0.75Γ—)
  • Attention: 15 Q-heads, 5 KV-heads, head_dim=64
  • Prefix length: 113 (64 img + 48 lang + 1 state)
  • Denoising: 10 steps, flow matching (time: 1.0 β†’ 0.0)

ONNX Conversion Details

Key Challenges Solved

  1. ScatterND bypass: SigLIP VisionEmbeddings uses bucketize + ScatterND for position IDs. Bypassed by precomputing position IDs for static 512Γ—512 input.

  2. CumSum on bool: ONNX doesn't support CumSum on bool tensors. Monkey-patched torch.cumsum to auto-cast bool β†’ int64 before export.

  3. bfloat16 β†’ float32: ONNX opset 17 has limited bfloat16 support. All weights converted to float32.

  4. Flash Attention β†’ Eager: Forced _attn_implementation="eager" to avoid SDPA/Flash Attention export failures.

  5. Device mismatch: Legacy TorchScript exporter has constant folding issues on CUDA. Exported on CPU with dynamo=False.

Reproducing the Conversion

# From the original PyTorch model:
python scripts/export_to_onnx.py

# Validate cosine similarity:
python scripts/validate_onnx.py

QNN Conversion

Prerequisites

Install the Qualcomm AI Engine Direct SDK (QAIRT) v2.43.0+:

# Set up QAIRT SDK environment
export QNN_SDK_ROOT=/path/to/qairt/2.43.0.260128

# Create Python 3.10 environment (required by QAIRT)
conda create -n qairt python=3.10 -y
conda activate qairt
pip install numpy==1.26.0 onnx==1.16.1 pyyaml pandas protobuf

# Install libc++ and libunwind
conda install -c conda-forge libcxx libunwind -y
ln -sf $CONDA_PREFIX/lib/libunwind.so.8 $CONDA_PREFIX/lib/libunwind.so.1

# Set library paths
export PYTHONPATH="$QNN_SDK_ROOT/lib/python"
export LD_LIBRARY_PATH="$CONDA_PREFIX/lib:$QNN_SDK_ROOT/lib/x86_64-linux-clang:$LD_LIBRARY_PATH"

Converting ONNX to QNN

# Convert each model
for model in vision_encoder llm_backbone action_head; do
    $QNN_SDK_ROOT/bin/x86_64-linux-clang/qnn-onnx-converter \
        -i onnx/${model}.onnx \
        -o qnn/${model}
done

# Generate model libraries (x86_64)
for model in vision_encoder llm_backbone action_head; do
    $QNN_SDK_ROOT/bin/x86_64-linux-clang/qnn-model-lib-generator \
        -c qnn/${model}.cpp \
        -b qnn/${model}.bin \
        -o qnn/ \
        -t x86_64-linux-clang
done

Generating HTP Context Binary (on-device)

# Requires Snapdragon target device or HTP emulator
$QNN_SDK_ROOT/bin/x86_64-linux-clang/qnn-context-binary-generator \
    --model qnn/vision_encoder.so \
    --backend $QNN_SDK_ROOT/lib/x86_64-linux-clang/libQnnHtp.so \
    --output_dir qnn/

GR1 Microwave Task

The GR1 Microwave task in IsaacLab Arena involves a humanoid robot (GR1 Pink) manipulating a microwave oven with a mustard bottle:

  • Robot: Fourier GR1 humanoid (54 joint positions)
  • Task: Open microwave, place bottle, close microwave
  • Action space: 36-dim joint velocity commands
  • Observation: RGB camera (512Γ—512) + joint positions (54-dim)
  • Success criteria: Mustard bottle placed in microwave

Citation

@article{smolvla2025,
    title={SmolVLA: A Small Vision-Language-Action Model for Efficient Robot Learning},
    author={HuggingFace Team},
    year={2025},
    url={https://huggingface.co/papers/2506.01844}
}

License

This model conversion follows the license of the original nvidia/smolvla-arena-gr1-microwave model.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for xpuenabler/smolvla-arena-gr1-microwave-QNN-CPU-W16A16