YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

SmolVLA LIBERO QNN HTP W16A16 (FP16)

QNN HTP models for SmolVLA (Small Vision-Language-Action model) on Qualcomm Hexagon Tensor Processor.

Models

Model DLL Size BIN Size Description
libvision_encoder_htp 188 MB 188 MB SigLIP vision encoder, input: [1,3,512,512] fp16
libllm_backbone_htp 675 MB 673 MB SmolLM2 language model backbone
libaction_head_v2_htp 188 MB 186 MB Flow matching action head (diffusion denoiser)

Architecture

SmolVLA uses a 3-stage pipeline:

  1. Vision Encoder (SigLIP): pixel_values [1,3,512,512] -> image_embeddings [1,64,960]
  2. LLM Backbone (SmolLM2): image_embs + language tokens + state -> KV cache [32,1,177,5,64]
  3. Action Head (DiT): 10-step flow matching denoising -> velocity [1,50,32]

Platform

  • Target: Windows ARM64 (Snapdragon X Elite / Qualcomm Orion)
  • Backend: QNN HTP (Hexagon Tensor Processor)
  • QNN SDK: v2.43.0.260128
  • Precision: FP16 (W16A16)

ONNX β†’ QNN Conversion Pipeline

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    ONNX β†’ QNN HTP W16A16 Conversion                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  Original SmolVLA ONNX
  (from HuggingFace)
         β”‚
         β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚ [1] ONNX Preproc  β”‚  fold_and_eliminate_bool.py
 β”‚                   β”‚  β€’ Folds constant bool subgraphs (CumSum chains)
 β”‚                   β”‚  β€’ Eliminates And/Or β†’ Mul/Add+Clip (int32)
 │                   │  ‒ Removes incompatible Cast→BOOL_8 ops
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚  Preprocessed ONNX
          β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚ [2] QNN Converter β”‚  qnn-onnx-converter (patched: matmul_to_fc disabled)
 β”‚                   β”‚  --float_bitwidth 16
 β”‚                   β”‚  Output: model.cpp + model.bin
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚  QNN model sources
          β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚ [3] Lib Generator β”‚  qnn-model-lib-generator
 β”‚                   β”‚  -t windows-aarch64
 β”‚                   β”‚  Output: object files (.o) in tmp_*/obj/
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚  .o files
          β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚ [4] ARM64 Compile β”‚  clang-cl --target=aarch64-pc-windows-msvc
 β”‚                   β”‚  Compile: QnnModel.cpp, QnnWrapperUtils.cpp,
 β”‚                   β”‚           QnnModelPal.cpp, QnnNetworkModel.cpp
 β”‚     + Link        β”‚  lld-link /DLL /MACHINE:ARM64
 β”‚                   β”‚  Libs: msvcrt.lib, kernel32.lib, QnnHtp.lib
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚  libXXX_htp.dll + libXXX_htp.bin
          β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚ [5] Context Gen   β”‚  qnn-context-binary-generator
 β”‚                   β”‚  --model libXXX_htp.dll
 β”‚                   β”‚  --backend QnnHtp.dll
 β”‚                   β”‚  Output: libXXX_htp_fp16.serialized.bin
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚  Pre-compiled HTP binary (no DSP JIT on first run)
          β–Ό
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚ [6] Inference     β”‚  qnn-net-run
 β”‚                   β”‚  --retrieve_context libXXX.serialized.bin
 β”‚                   β”‚  --backend QnnHtp.dll
 β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

LLM Backbone Special Handling

The LLM backbone requires CPU-side embedding lookup to avoid HTP batch-size issues:

  Token IDs (CPU)
       β”‚
       β–Ό
  embed_tokens.weight  ← loaded from ONNX initializer
  (CPU Gather lookup)
       β”‚
       β–Ό  lang_embeddings [1, 48, 960]
       β–Ό
  LLM HTP Model ──→ KV cache output
  (embedding input, no Gather op inside QNN model)

Script: scripts/fix_llm_embed_on_cpu.py removes the Gather op and uses /embed_tokens/Gather_output_0 [1,48,960] as direct HTP input.

Inference Speed

Platform: Snapdragon X Elite (X1E78100), Windows ARM64 Backend: QNN HTP v73 (Hexagon DSP) Context binaries: pre-compiled (no DSP JIT per run)

Stage Model Time
Vision Encoder libvision_encoder_htp ~0.7 s
LLM Backbone libllm_backbone_htp ~1.5 s
Action Head Γ— 10 libaction_head_v2_htp ~6.0 s
Total per step ~8–10 s

Note: First run without context binary requires DSP JIT compilation (~30–60 s per model). Context binaries eliminate this overhead on subsequent runs.

Note on FP16 saturation: On some Snapdragon X Elite configurations, --float_bitwidth 16 may cause output saturation (values clamped to Β±512). If this occurs, use FP32 DLLs instead β€” the HTP backend auto-converts to FP16 internally with better numeric precision.

Usage

# See inference/infer_libero_episode_qnn_htp.py for full LIBERO inference
python infer_libero_episode_qnn_htp.py --precision fp16 --task_id 0 --trial_id 0

Requires qnn-net-run.exe from QNN SDK with HTP backend (QnnHtp.dll).

ONNX Preprocessing

The original SmolVLA ONNX models require preprocessing before QNN conversion:

  • Constant folding: Evaluates all-constant bool subgraphs (CumSum chains)
  • Bool elimination: Replaces And/Or with Mul/Add+Clip, converts comparisons to int32
  • Script: scripts/fold_and_eliminate_bool.py

Files

  • qnn_models/windows-aarch64-fp16/ - FP16 DLL + BIN files for HTP
  • qnn_models/windows-aarch64-fp32/ - FP32 DLL + BIN files for CPU fallback testing
  • onnx_models/ - Preprocessed ONNX models
  • scripts/ - Build and preprocessing scripts
  • inference/ - LIBERO inference scripts
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support