SmolVLA LIBERO QNN HTP W16A16 (FP16)

QNN HTP models for SmolVLA (Small Vision-Language-Action model) on Qualcomm Hexagon Tensor Processor.

Models

Model	DLL Size	BIN Size	Description
`libvision_encoder_htp`	188 MB	188 MB	SigLIP vision encoder, input: [1,3,512,512] fp16
`libllm_backbone_htp`	675 MB	673 MB	SmolLM2 language model backbone
`libaction_head_v2_htp`	188 MB	186 MB	Flow matching action head (diffusion denoiser)

Architecture

SmolVLA uses a 3-stage pipeline:

Vision Encoder (SigLIP): pixel_values [1,3,512,512] -> image_embeddings [1,64,960]
LLM Backbone (SmolLM2): image_embs + language tokens + state -> KV cache [32,1,177,5,64]
Action Head (DiT): 10-step flow matching denoising -> velocity [1,50,32]

Platform

Target: Windows ARM64 (Snapdragon X Elite / Qualcomm Orion)
Backend: QNN HTP (Hexagon Tensor Processor)
QNN SDK: v2.43.0.260128
Precision: FP16 (W16A16)

ONNX → QNN Conversion Pipeline

┌─────────────────────────────────────────────────────────────────────────┐
│                    ONNX → QNN HTP W16A16 Conversion                     │
└─────────────────────────────────────────────────────────────────────────┘

  Original SmolVLA ONNX
  (from HuggingFace)
         │
         ▼
 ┌───────────────────┐
 │ [1] ONNX Preproc  │  fold_and_eliminate_bool.py
 │                   │  • Folds constant bool subgraphs (CumSum chains)
 │                   │  • Eliminates And/Or → Mul/Add+Clip (int32)
 │                   │  • Removes incompatible Cast→BOOL_8 ops
 └────────┬──────────┘
          │  Preprocessed ONNX
          ▼
 ┌───────────────────┐
 │ [2] QNN Converter │  qnn-onnx-converter (patched: matmul_to_fc disabled)
 │                   │  --float_bitwidth 16
 │                   │  Output: model.cpp + model.bin
 └────────┬──────────┘
          │  QNN model sources
          ▼
 ┌───────────────────┐
 │ [3] Lib Generator │  qnn-model-lib-generator
 │                   │  -t windows-aarch64
 │                   │  Output: object files (.o) in tmp_*/obj/
 └────────┬──────────┘
          │  .o files
          ▼
 ┌───────────────────┐
 │ [4] ARM64 Compile │  clang-cl --target=aarch64-pc-windows-msvc
 │                   │  Compile: QnnModel.cpp, QnnWrapperUtils.cpp,
 │                   │           QnnModelPal.cpp, QnnNetworkModel.cpp
 │     + Link        │  lld-link /DLL /MACHINE:ARM64
 │                   │  Libs: msvcrt.lib, kernel32.lib, QnnHtp.lib
 └────────┬──────────┘
          │  libXXX_htp.dll + libXXX_htp.bin
          ▼
 ┌───────────────────┐
 │ [5] Context Gen   │  qnn-context-binary-generator
 │                   │  --model libXXX_htp.dll
 │                   │  --backend QnnHtp.dll
 │                   │  Output: libXXX_htp_fp16.serialized.bin
 └────────┬──────────┘
          │  Pre-compiled HTP binary (no DSP JIT on first run)
          ▼
 ┌───────────────────┐
 │ [6] Inference     │  qnn-net-run
 │                   │  --retrieve_context libXXX.serialized.bin
 │                   │  --backend QnnHtp.dll
 └───────────────────┘

LLM Backbone Special Handling

The LLM backbone requires CPU-side embedding lookup to avoid HTP batch-size issues:

  Token IDs (CPU)
       │
       ▼
  embed_tokens.weight  ← loaded from ONNX initializer
  (CPU Gather lookup)
       │
       ▼  lang_embeddings [1, 48, 960]
       ▼
  LLM HTP Model ──→ KV cache output
  (embedding input, no Gather op inside QNN model)

Script: scripts/fix_llm_embed_on_cpu.py removes the Gather op and uses /embed_tokens/Gather_output_0 [1,48,960] as direct HTP input.

Inference Speed

Platform: Snapdragon X Elite (X1E78100), Windows ARM64 Backend: QNN HTP v73 (Hexagon DSP) Context binaries: pre-compiled (no DSP JIT per run)

Stage	Model	Time
Vision Encoder	libvision_encoder_htp	~0.7 s
LLM Backbone	libllm_backbone_htp	~1.5 s
Action Head × 10	libaction_head_v2_htp	~6.0 s
Total per step		~8–10 s

Note: First run without context binary requires DSP JIT compilation (~30–60 s per model). Context binaries eliminate this overhead on subsequent runs.

Note on FP16 saturation: On some Snapdragon X Elite configurations, --float_bitwidth 16 may cause output saturation (values clamped to ±512). If this occurs, use FP32 DLLs instead — the HTP backend auto-converts to FP16 internally with better numeric precision.

Usage

# See inference/infer_libero_episode_qnn_htp.py for full LIBERO inference
python infer_libero_episode_qnn_htp.py --precision fp16 --task_id 0 --trial_id 0

Requires qnn-net-run.exe from QNN SDK with HTP backend (QnnHtp.dll).

ONNX Preprocessing

The original SmolVLA ONNX models require preprocessing before QNN conversion:

Constant folding: Evaluates all-constant bool subgraphs (CumSum chains)
Bool elimination: Replaces And/Or with Mul/Add+Clip, converts comparisons to int32
Script: scripts/fold_and_eliminate_bool.py

Files

qnn_models/windows-aarch64-fp16/ - FP16 DLL + BIN files for HTP
qnn_models/windows-aarch64-fp32/ - FP32 DLL + BIN files for CPU fallback testing
onnx_models/ - Preprocessed ONNX models
scripts/ - Build and preprocessing scripts
inference/ - LIBERO inference scripts

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support