YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
SmolVLA LIBERO β QNN HTP W8A16 (INT8 Weight Quantization)
SmolVLA 3-model pipeline (Vision Encoder + LLM Backbone + Action Head) on Qualcomm HTP (Hexagon Tensor Processor) with INT8 weight quantization (W8A16). Achieves ~4.8s E2E latency (wall-clock) on Snapdragon X Elite, successfully completing LIBERO Task 3 with effective control frequency of 0.84 Hz (n_action_steps=4) or 1.70 Hz (n_action_steps=8).
Hardware / Software Requirements
| Item | Spec |
|---|---|
| Device | Snapdragon X Elite (or X Plus) β Windows ARM64 |
| OS | Windows 11 ARM64 |
| Python | 3.10 x86-64 (via Prism emulation) β python.org |
| QNN SDK | QAIRT 2.43.0.260128 β Qualcomm AI Hub |
| Build Tools | VS 2022 Build Tools (ARM64 clang-cl + lld-link) |
Note: Python must be x86-64 (not ARM64) because LIBERO / robosuite require x86 compatibility.
Installation
1. Python packages
pip install "numpy<2.0" torch torchvision transformers safetensors onnx imageio pillow
2. LIBERO benchmark
git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
cd LIBERO && pip install -e .
3. QNN SDK
Download and install QAIRT 2.43.0.260128. Set environment:
export QNN_SDK=C:/Users/<user>/qualcomm/qairt/2.43.0.260128
4. Clone this repo & SmolVLA weights
git clone https://huggingface.co/xpuenabler/smolvla-libero-QNN-HTP-W8A16
cd smolvla-libero-QNN-HTP-W8A16
# Download SmolVLA weights (lerobot/smolvla_base)
python -c "
from huggingface_hub import snapshot_download
snapshot_download('lerobot/smolvla_base', local_dir='smolvla_weights')
"
Model Conversion Pipeline
Step 0: Export ONNX (if needed)
If you don't have the ONNX models, export from LeRobot:
python scripts/export_onnx.py --weights smolvla_weights/
# Produces: onnx_models/vision_encoder.onnx
# onnx_models/llm_backbone.onnx
# onnx_models/action_head.onnx
Step 1: ONNX Preprocessing β Bool Elimination
QNN converter cannot handle Python bool chains. Apply fixes:
# LLM backbone: fold bool constants + eliminate bool ops
python scripts/fold_and_eliminate_bool.py onnx_models/llm_backbone.onnx onnx_models/llm_backbone_v7.onnx
# Action head: eliminate bool + fix Expand op
python scripts/fold_and_eliminate_bool.py onnx_models/action_head.onnx onnx_models/action_head_v5.onnx
python scripts/fix_ah_expand_bool.py onnx_models/action_head_v5.onnx onnx_models/action_head_qnn_v6.onnx
# LLM backbone v7: remove embedding Gather (run on CPU side instead)
python scripts/fix_llm_embed_on_cpu.py onnx_models/llm_backbone_v7.onnx
# Output: onnx_models/llm_backbone_v7.onnx (in-place update)
Step 2a: Build FP32 QNN Models (HTP backend)
python scripts/build_all_models.py --precision fp32
# Output: qnn_models/windows-aarch64-fp32/lib*.dll
Step 2b: Build INT8W QNN Models (W8A16)
python scripts/build_int8w_models.py
# Output: qnn_models/windows-aarch64-int8w/lib*.dll
Requires
qnn_convert_patched.pyat repo root (disables matmul_to_fc optimization):cp $QNN_SDK/lib/python/qnn_convert.py qnn_convert_patched.py # Apply patch: disable matmul_to_fc (see scripts/build_all_models.py header)
Step 3: Create Unrolled Action Head (10 denoising steps baked in)
python scripts/create_ah_unrolled.py
# Input: onnx_models/action_head_qnn_v6.onnx
# Output: onnx_models/action_head_unrolled10.onnx (393 MB)
# qnn_context_cache/libaction_head_unrolled_htp.dll (387 MB)
# Build time: ~24 min (Graph Sequencing stage)
Step 4: Generate Context Binaries (pre-compiled HTP graphs)
# FP32 models
python scripts/build_all_models.py --context-only
# INT8W models
python scripts/gen_int8w_context_binaries.py
# Output: qnn_context_cache/lib*_fp32.serialized.bin
# qnn_context_cache/lib*_int8w.serialized.bin
Context binaries save ~0.5-1s per model call by skipping graph compilation.
Running Inference
Basic (FP32 on HTP)
python inference/infer_libero_episode_qnn_htp.py \
--precision htp \
--task-id 3 \
--trial-id 3 \
--n-action-steps 4 \
--use-unrolled-ah \
--output output/result.mp4
INT8W on HTP
python inference/infer_libero_episode_qnn_htp.py \
--precision int8w \
--task-id 3 \
--trial-id 3 \
--n-action-steps 4 \
--use-unrolled-ah \
--output output/result_int8w.mp4
Key Arguments
| Argument | Default | Description |
|---|---|---|
--precision |
fp32 |
htp (FP32 DLL on HTP), int8w (W8A16 on HTP), fp32 (CPU) |
--task-id |
3 |
LIBERO task index (0-9) |
--trial-id |
0 |
Trial index (0-49) |
--n-action-steps |
1 |
Receding horizon control (1-16) |
--use-unrolled-ah |
off | Use 10-step unrolled Action Head (single QNN call) |
--max-steps |
520 |
Max episode steps |
Inference Speed
Per-model Latency (Snapdragon X Elite)
Model execution time measured individually (excludes subprocess spawn overhead ~390ms/inference).
| Model | CPU (FP32) | HTP (FP32 DLL) | Speedup |
|---|---|---|---|
| Vision Encoder | 3.2s | 0.7s | 4.6Γ |
| LLM Backbone | 21.0s | 1.5s | 14.0Γ |
| Action Head Γ10 | 6.0s | 1.1s (unrolled) | 5.5Γ |
| Model total | ~30s | ~3.3s | ~9Γ |
| Wall-clock (E2E latency) | β | ~4.8s | β |
Wall-clock includes subprocess overhead (~130ms Γ 3 calls), temp file I/O, and NumPy transpositions on top of model execution time.
End-to-End Inference (Task 3, Trial 3)
| Config | E2E Latency | n_action_steps | Control Freq | Result |
|---|---|---|---|---|
| HTP FP32, n=8, unrolled | 4,701ms | 8 | 1.70 Hz | FAILED |
| HTP FP32, n=4, unrolled | 4,762ms | 4 | 0.84 Hz | SUCCESS |
- E2E Latency: wall-clock time from observation capture to first action output
- Control Freq: effective action execution rate = n_action_steps / E2E_latency
CPU vs HTP Output Similarity (cosine)
| Component | Cosine |
|---|---|
| Vision Encoder | 0.99999 |
| LLM kv_keys | 0.889 |
| LLM kv_values | 0.768 |
| Action Head (same KV) | 0.99999 |
| End-to-end velocity | 0.986 |
LLM similarity is lower due to FP16 internal accumulation across 32 transformer layers.
Architecture
Observation (image + state + language)
β
βΌ
[Vision Encoder] pixel_values[1,3,512,512] β image_embeddings[1,64,960]
β
βΌ
[LLM Backbone] image_embsΓ2 + lang_emb[1,48,960] + state[1,32]
β kv_keys/values[32,1,177,5,64] + prefix_pad_masks[1,177]
β
βΌ
[Action Head] 10-step flow matching (Euler, t:1.0β0.1, dt=-0.1)
(Unrolled) noisy_actions[1,50,32] β denoised_actions[1,50,32]
β
βΌ
Actions (7-DoF: eef_pos + axis_angle + gripper)
Key Implementation Notes
- FP32 DLL on HTP (not FP16):
--float_bitwidth 16causes Β±512 saturation. FP32 DLL + HTP auto-converts to FP16 internally with better precision. - CPU-side embedding: LLM v7 removes
Gatherop (HTP INT32 input bug). Embedding lookup done in Python withembed_tokens.weight. - Bool elimination: QNN strips bool chains.
fold_and_eliminate_bool.pyreplaces with int32/float ops. - Context binaries: Pre-serialized HTP graphs for ~0.5s faster per-model loading.
- Tensor transpositions: HTP auto-transposes some dims; see
run_vision_encoder,run_llm_backbone_htpin inference script.
File Structure
βββ inference/
β βββ infer_libero_episode_qnn_htp.py # Main LIBERO inference script
βββ scripts/
β βββ build_all_models.py # FP32 QNN build
β βββ build_int8w_models.py # W8A16 QNN build
β βββ create_ah_unrolled.py # Unrolled 10-step Action Head
β βββ gen_int8w_context_binaries.py # Context binary generation
β βββ fold_and_eliminate_bool.py # ONNX bool preprocessing
β βββ fix_ah_expand_bool.py # Action Head bool fix
β βββ fix_llm_embed_on_cpu.py # LLM v7 embedding removal
β βββ fix_cpp_for_htp.py # C++ output HTP patch
βββ qnn_models/
β βββ windows-aarch64-fp32/ # FP32 DLLs (run with HTP backend)
β βββ windows-aarch64-int8w/ # W8A16 DLLs
βββ qnn_context_cache/
β βββ lib*_fp32.serialized.bin # Pre-compiled HTP contexts
β βββ libaction_head_unrolled_htp.* # Unrolled AH (10-step)
βββ onnx_models/ # Preprocessed ONNX models
βββ policy_preprocessor_step_5_normalizer_processor.safetensors
βββ policy_postprocessor_step_1_unnormalizer_processor.safetensors