YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

SmolVLA LIBERO — QNN HTP W8A16 (INT8 Weight Quantization)

SmolVLA 3-model pipeline (Vision Encoder + LLM Backbone + Action Head) on Qualcomm HTP (Hexagon Tensor Processor) with INT8 weight quantization (W8A16). Achieves ~4.8s E2E latency (wall-clock) on Snapdragon X Elite, successfully completing LIBERO Task 3 with effective control frequency of 0.84 Hz (n_action_steps=4) or 1.70 Hz (n_action_steps=8).

Hardware / Software Requirements

Item	Spec
Device	Snapdragon X Elite (or X Plus) — Windows ARM64
OS	Windows 11 ARM64
Python	3.10 x86-64 (via Prism emulation) — python.org
QNN SDK	QAIRT 2.43.0.260128 — Qualcomm AI Hub
Build Tools	VS 2022 Build Tools (ARM64 clang-cl + lld-link)

Note: Python must be x86-64 (not ARM64) because LIBERO / robosuite require x86 compatibility.

Installation

1. Python packages

pip install "numpy<2.0" torch torchvision transformers safetensors onnx imageio pillow

2. LIBERO benchmark

git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
cd LIBERO && pip install -e .

3. QNN SDK

Download and install QAIRT 2.43.0.260128. Set environment:

export QNN_SDK=C:/Users/<user>/qualcomm/qairt/2.43.0.260128

4. Clone this repo & SmolVLA weights

git clone https://huggingface.co/xpuenabler/smolvla-libero-QNN-HTP-W8A16
cd smolvla-libero-QNN-HTP-W8A16

# Download SmolVLA weights (lerobot/smolvla_base)
python -c "
from huggingface_hub import snapshot_download
snapshot_download('lerobot/smolvla_base', local_dir='smolvla_weights')
"

Model Conversion Pipeline

Step 0: Export ONNX (if needed)

If you don't have the ONNX models, export from LeRobot:

python scripts/export_onnx.py --weights smolvla_weights/
# Produces: onnx_models/vision_encoder.onnx
#            onnx_models/llm_backbone.onnx
#            onnx_models/action_head.onnx

Step 1: ONNX Preprocessing — Bool Elimination

QNN converter cannot handle Python bool chains. Apply fixes:

# LLM backbone: fold bool constants + eliminate bool ops
python scripts/fold_and_eliminate_bool.py onnx_models/llm_backbone.onnx onnx_models/llm_backbone_v7.onnx

# Action head: eliminate bool + fix Expand op
python scripts/fold_and_eliminate_bool.py onnx_models/action_head.onnx onnx_models/action_head_v5.onnx
python scripts/fix_ah_expand_bool.py onnx_models/action_head_v5.onnx onnx_models/action_head_qnn_v6.onnx

# LLM backbone v7: remove embedding Gather (run on CPU side instead)
python scripts/fix_llm_embed_on_cpu.py onnx_models/llm_backbone_v7.onnx
# Output: onnx_models/llm_backbone_v7.onnx (in-place update)

Step 2a: Build FP32 QNN Models (HTP backend)

python scripts/build_all_models.py --precision fp32
# Output: qnn_models/windows-aarch64-fp32/lib*.dll

Step 2b: Build INT8W QNN Models (W8A16)

python scripts/build_int8w_models.py
# Output: qnn_models/windows-aarch64-int8w/lib*.dll

Requires qnn_convert_patched.py at repo root (disables matmul_to_fc optimization):
cp $QNN_SDK/lib/python/qnn_convert.py qnn_convert_patched.py
# Apply patch: disable matmul_to_fc (see scripts/build_all_models.py header)

Step 3: Create Unrolled Action Head (10 denoising steps baked in)

python scripts/create_ah_unrolled.py
# Input:  onnx_models/action_head_qnn_v6.onnx
# Output: onnx_models/action_head_unrolled10.onnx (393 MB)
#         qnn_context_cache/libaction_head_unrolled_htp.dll (387 MB)
# Build time: ~24 min (Graph Sequencing stage)

Step 4: Generate Context Binaries (pre-compiled HTP graphs)

# FP32 models
python scripts/build_all_models.py --context-only

# INT8W models
python scripts/gen_int8w_context_binaries.py
# Output: qnn_context_cache/lib*_fp32.serialized.bin
#         qnn_context_cache/lib*_int8w.serialized.bin

Context binaries save ~0.5-1s per model call by skipping graph compilation.

Running Inference

Basic (FP32 on HTP)

python inference/infer_libero_episode_qnn_htp.py \
  --precision htp \
  --task-id 3 \
  --trial-id 3 \
  --n-action-steps 4 \
  --use-unrolled-ah \
  --output output/result.mp4

INT8W on HTP

python inference/infer_libero_episode_qnn_htp.py \
  --precision int8w \
  --task-id 3 \
  --trial-id 3 \
  --n-action-steps 4 \
  --use-unrolled-ah \
  --output output/result_int8w.mp4

Key Arguments

Argument	Default	Description
`--precision`	`fp32`	`htp` (FP32 DLL on HTP), `int8w` (W8A16 on HTP), `fp32` (CPU)
`--task-id`	`3`	LIBERO task index (0-9)
`--trial-id`	`0`	Trial index (0-49)
`--n-action-steps`	`1`	Receding horizon control (1-16)
`--use-unrolled-ah`	off	Use 10-step unrolled Action Head (single QNN call)
`--max-steps`	`520`	Max episode steps

Inference Speed

Per-model Latency (Snapdragon X Elite)

Model execution time measured individually (excludes subprocess spawn overhead ~390ms/inference).

Model	CPU (FP32)	HTP (FP32 DLL)	Speedup
Vision Encoder	3.2s	0.7s	4.6×
LLM Backbone	21.0s	1.5s	14.0×
Action Head ×10	6.0s	1.1s (unrolled)	5.5×
Model total	~30s	~3.3s	~9×
Wall-clock (E2E latency)	—	~4.8s	—

Wall-clock includes subprocess overhead (~130ms × 3 calls), temp file I/O, and NumPy transpositions on top of model execution time.

End-to-End Inference (Task 3, Trial 3)

Config	E2E Latency	n_action_steps	Control Freq	Result
HTP FP32, n=8, unrolled	4,701ms	8	1.70 Hz	FAILED
HTP FP32, n=4, unrolled	4,762ms	4	0.84 Hz	SUCCESS

E2E Latency: wall-clock time from observation capture to first action output
Control Freq: effective action execution rate = n_action_steps / E2E_latency

CPU vs HTP Output Similarity (cosine)

Component	Cosine
Vision Encoder	0.99999
LLM kv_keys	0.889
LLM kv_values	0.768
Action Head (same KV)	0.99999
End-to-end velocity	0.986

LLM similarity is lower due to FP16 internal accumulation across 32 transformer layers.

Architecture

Observation (image + state + language)
        │
        ▼
[Vision Encoder]  pixel_values[1,3,512,512] → image_embeddings[1,64,960]
        │
        ▼
[LLM Backbone]    image_embs×2 + lang_emb[1,48,960] + state[1,32]
                  → kv_keys/values[32,1,177,5,64] + prefix_pad_masks[1,177]
        │
        ▼
[Action Head]     10-step flow matching (Euler, t:1.0→0.1, dt=-0.1)
  (Unrolled)      noisy_actions[1,50,32] → denoised_actions[1,50,32]
        │
        ▼
Actions (7-DoF: eef_pos + axis_angle + gripper)

Key Implementation Notes

FP32 DLL on HTP (not FP16): --float_bitwidth 16 causes ±512 saturation. FP32 DLL + HTP auto-converts to FP16 internally with better precision.
CPU-side embedding: LLM v7 removes Gather op (HTP INT32 input bug). Embedding lookup done in Python with embed_tokens.weight.
Bool elimination: QNN strips bool chains. fold_and_eliminate_bool.py replaces with int32/float ops.
Context binaries: Pre-serialized HTP graphs for ~0.5s faster per-model loading.
Tensor transpositions: HTP auto-transposes some dims; see run_vision_encoder, run_llm_backbone_htp in inference script.

File Structure

├── inference/
│   └── infer_libero_episode_qnn_htp.py   # Main LIBERO inference script
├── scripts/
│   ├── build_all_models.py               # FP32 QNN build
│   ├── build_int8w_models.py             # W8A16 QNN build
│   ├── create_ah_unrolled.py             # Unrolled 10-step Action Head
│   ├── gen_int8w_context_binaries.py     # Context binary generation
│   ├── fold_and_eliminate_bool.py        # ONNX bool preprocessing
│   ├── fix_ah_expand_bool.py             # Action Head bool fix
│   ├── fix_llm_embed_on_cpu.py           # LLM v7 embedding removal
│   └── fix_cpp_for_htp.py                # C++ output HTP patch
├── qnn_models/
│   ├── windows-aarch64-fp32/             # FP32 DLLs (run with HTP backend)
│   └── windows-aarch64-int8w/            # W8A16 DLLs
├── qnn_context_cache/
│   ├── lib*_fp32.serialized.bin          # Pre-compiled HTP contexts
│   └── libaction_head_unrolled_htp.*     # Unrolled AH (10-step)
├── onnx_models/                          # Preprocessed ONNX models
├── policy_preprocessor_step_5_normalizer_processor.safetensors
└── policy_postprocessor_step_1_unnormalizer_processor.safetensors

References

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support