YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
SmolVLA LIBERO QNN HTP W16A16 (FP16)
QNN HTP models for SmolVLA (Small Vision-Language-Action model) on Qualcomm Hexagon Tensor Processor.
Models
| Model | DLL Size | BIN Size | Description |
|---|---|---|---|
libvision_encoder_htp |
188 MB | 188 MB | SigLIP vision encoder, input: [1,3,512,512] fp16 |
libllm_backbone_htp |
675 MB | 673 MB | SmolLM2 language model backbone |
libaction_head_v2_htp |
188 MB | 186 MB | Flow matching action head (diffusion denoiser) |
Architecture
SmolVLA uses a 3-stage pipeline:
- Vision Encoder (SigLIP): pixel_values [1,3,512,512] -> image_embeddings [1,64,960]
- LLM Backbone (SmolLM2): image_embs + language tokens + state -> KV cache [32,1,177,5,64]
- Action Head (DiT): 10-step flow matching denoising -> velocity [1,50,32]
Platform
- Target: Windows ARM64 (Snapdragon X Elite / Qualcomm Orion)
- Backend: QNN HTP (Hexagon Tensor Processor)
- QNN SDK: v2.43.0.260128
- Precision: FP16 (W16A16)
ONNX β QNN Conversion Pipeline
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ONNX β QNN HTP W16A16 Conversion β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Original SmolVLA ONNX
(from HuggingFace)
β
βΌ
βββββββββββββββββββββ
β [1] ONNX Preproc β fold_and_eliminate_bool.py
β β β’ Folds constant bool subgraphs (CumSum chains)
β β β’ Eliminates And/Or β Mul/Add+Clip (int32)
β β β’ Removes incompatible CastβBOOL_8 ops
ββββββββββ¬βββββββββββ
β Preprocessed ONNX
βΌ
βββββββββββββββββββββ
β [2] QNN Converter β qnn-onnx-converter (patched: matmul_to_fc disabled)
β β --float_bitwidth 16
β β Output: model.cpp + model.bin
ββββββββββ¬βββββββββββ
β QNN model sources
βΌ
βββββββββββββββββββββ
β [3] Lib Generator β qnn-model-lib-generator
β β -t windows-aarch64
β β Output: object files (.o) in tmp_*/obj/
ββββββββββ¬βββββββββββ
β .o files
βΌ
βββββββββββββββββββββ
β [4] ARM64 Compile β clang-cl --target=aarch64-pc-windows-msvc
β β Compile: QnnModel.cpp, QnnWrapperUtils.cpp,
β β QnnModelPal.cpp, QnnNetworkModel.cpp
β + Link β lld-link /DLL /MACHINE:ARM64
β β Libs: msvcrt.lib, kernel32.lib, QnnHtp.lib
ββββββββββ¬βββββββββββ
β libXXX_htp.dll + libXXX_htp.bin
βΌ
βββββββββββββββββββββ
β [5] Context Gen β qnn-context-binary-generator
β β --model libXXX_htp.dll
β β --backend QnnHtp.dll
β β Output: libXXX_htp_fp16.serialized.bin
ββββββββββ¬βββββββββββ
β Pre-compiled HTP binary (no DSP JIT on first run)
βΌ
βββββββββββββββββββββ
β [6] Inference β qnn-net-run
β β --retrieve_context libXXX.serialized.bin
β β --backend QnnHtp.dll
βββββββββββββββββββββ
LLM Backbone Special Handling
The LLM backbone requires CPU-side embedding lookup to avoid HTP batch-size issues:
Token IDs (CPU)
β
βΌ
embed_tokens.weight β loaded from ONNX initializer
(CPU Gather lookup)
β
βΌ lang_embeddings [1, 48, 960]
βΌ
LLM HTP Model βββ KV cache output
(embedding input, no Gather op inside QNN model)
Script: scripts/fix_llm_embed_on_cpu.py removes the Gather op and uses /embed_tokens/Gather_output_0 [1,48,960] as direct HTP input.
Inference Speed
Platform: Snapdragon X Elite (X1E78100), Windows ARM64 Backend: QNN HTP v73 (Hexagon DSP) Context binaries: pre-compiled (no DSP JIT per run)
| Stage | Model | Time |
|---|---|---|
| Vision Encoder | libvision_encoder_htp | ~0.7 s |
| LLM Backbone | libllm_backbone_htp | ~1.5 s |
| Action Head Γ 10 | libaction_head_v2_htp | ~6.0 s |
| Total per step | ~8β10 s |
Note: First run without context binary requires DSP JIT compilation (~30β60 s per model). Context binaries eliminate this overhead on subsequent runs.
Note on FP16 saturation: On some Snapdragon X Elite configurations,
--float_bitwidth 16may cause output saturation (values clamped to Β±512). If this occurs, use FP32 DLLs instead β the HTP backend auto-converts to FP16 internally with better numeric precision.
Usage
# See inference/infer_libero_episode_qnn_htp.py for full LIBERO inference
python infer_libero_episode_qnn_htp.py --precision fp16 --task_id 0 --trial_id 0
Requires qnn-net-run.exe from QNN SDK with HTP backend (QnnHtp.dll).
ONNX Preprocessing
The original SmolVLA ONNX models require preprocessing before QNN conversion:
- Constant folding: Evaluates all-constant bool subgraphs (CumSum chains)
- Bool elimination: Replaces And/Or with Mul/Add+Clip, converts comparisons to int32
- Script:
scripts/fold_and_eliminate_bool.py
Files
qnn_models/windows-aarch64-fp16/- FP16 DLL + BIN files for HTPqnn_models/windows-aarch64-fp32/- FP32 DLL + BIN files for CPU fallback testingonnx_models/- Preprocessed ONNX modelsscripts/- Build and preprocessing scriptsinference/- LIBERO inference scripts