YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

SmolVLA LIBERO β€” QNN HTP W8A16 (INT8 Weight Quantization)

SmolVLA 3-model pipeline (Vision Encoder + LLM Backbone + Action Head) on Qualcomm HTP (Hexagon Tensor Processor) with INT8 weight quantization (W8A16). Achieves ~4.8s E2E latency (wall-clock) on Snapdragon X Elite, successfully completing LIBERO Task 3 with effective control frequency of 0.84 Hz (n_action_steps=4) or 1.70 Hz (n_action_steps=8).


Hardware / Software Requirements

Item Spec
Device Snapdragon X Elite (or X Plus) β€” Windows ARM64
OS Windows 11 ARM64
Python 3.10 x86-64 (via Prism emulation) β€” python.org
QNN SDK QAIRT 2.43.0.260128 β€” Qualcomm AI Hub
Build Tools VS 2022 Build Tools (ARM64 clang-cl + lld-link)

Note: Python must be x86-64 (not ARM64) because LIBERO / robosuite require x86 compatibility.


Installation

1. Python packages

pip install "numpy<2.0" torch torchvision transformers safetensors onnx imageio pillow

2. LIBERO benchmark

git clone https://github.com/Lifelong-Robot-Learning/LIBERO.git
cd LIBERO && pip install -e .

3. QNN SDK

Download and install QAIRT 2.43.0.260128. Set environment:

export QNN_SDK=C:/Users/<user>/qualcomm/qairt/2.43.0.260128

4. Clone this repo & SmolVLA weights

git clone https://huggingface.co/xpuenabler/smolvla-libero-QNN-HTP-W8A16
cd smolvla-libero-QNN-HTP-W8A16

# Download SmolVLA weights (lerobot/smolvla_base)
python -c "
from huggingface_hub import snapshot_download
snapshot_download('lerobot/smolvla_base', local_dir='smolvla_weights')
"

Model Conversion Pipeline

Step 0: Export ONNX (if needed)

If you don't have the ONNX models, export from LeRobot:

python scripts/export_onnx.py --weights smolvla_weights/
# Produces: onnx_models/vision_encoder.onnx
#            onnx_models/llm_backbone.onnx
#            onnx_models/action_head.onnx

Step 1: ONNX Preprocessing β€” Bool Elimination

QNN converter cannot handle Python bool chains. Apply fixes:

# LLM backbone: fold bool constants + eliminate bool ops
python scripts/fold_and_eliminate_bool.py onnx_models/llm_backbone.onnx onnx_models/llm_backbone_v7.onnx

# Action head: eliminate bool + fix Expand op
python scripts/fold_and_eliminate_bool.py onnx_models/action_head.onnx onnx_models/action_head_v5.onnx
python scripts/fix_ah_expand_bool.py onnx_models/action_head_v5.onnx onnx_models/action_head_qnn_v6.onnx

# LLM backbone v7: remove embedding Gather (run on CPU side instead)
python scripts/fix_llm_embed_on_cpu.py onnx_models/llm_backbone_v7.onnx
# Output: onnx_models/llm_backbone_v7.onnx (in-place update)

Step 2a: Build FP32 QNN Models (HTP backend)

python scripts/build_all_models.py --precision fp32
# Output: qnn_models/windows-aarch64-fp32/lib*.dll

Step 2b: Build INT8W QNN Models (W8A16)

python scripts/build_int8w_models.py
# Output: qnn_models/windows-aarch64-int8w/lib*.dll

Requires qnn_convert_patched.py at repo root (disables matmul_to_fc optimization):

cp $QNN_SDK/lib/python/qnn_convert.py qnn_convert_patched.py
# Apply patch: disable matmul_to_fc (see scripts/build_all_models.py header)

Step 3: Create Unrolled Action Head (10 denoising steps baked in)

python scripts/create_ah_unrolled.py
# Input:  onnx_models/action_head_qnn_v6.onnx
# Output: onnx_models/action_head_unrolled10.onnx (393 MB)
#         qnn_context_cache/libaction_head_unrolled_htp.dll (387 MB)
# Build time: ~24 min (Graph Sequencing stage)

Step 4: Generate Context Binaries (pre-compiled HTP graphs)

# FP32 models
python scripts/build_all_models.py --context-only

# INT8W models
python scripts/gen_int8w_context_binaries.py
# Output: qnn_context_cache/lib*_fp32.serialized.bin
#         qnn_context_cache/lib*_int8w.serialized.bin

Context binaries save ~0.5-1s per model call by skipping graph compilation.


Running Inference

Basic (FP32 on HTP)

python inference/infer_libero_episode_qnn_htp.py \
  --precision htp \
  --task-id 3 \
  --trial-id 3 \
  --n-action-steps 4 \
  --use-unrolled-ah \
  --output output/result.mp4

INT8W on HTP

python inference/infer_libero_episode_qnn_htp.py \
  --precision int8w \
  --task-id 3 \
  --trial-id 3 \
  --n-action-steps 4 \
  --use-unrolled-ah \
  --output output/result_int8w.mp4

Key Arguments

Argument Default Description
--precision fp32 htp (FP32 DLL on HTP), int8w (W8A16 on HTP), fp32 (CPU)
--task-id 3 LIBERO task index (0-9)
--trial-id 0 Trial index (0-49)
--n-action-steps 1 Receding horizon control (1-16)
--use-unrolled-ah off Use 10-step unrolled Action Head (single QNN call)
--max-steps 520 Max episode steps

Inference Speed

Per-model Latency (Snapdragon X Elite)

Model execution time measured individually (excludes subprocess spawn overhead ~390ms/inference).

Model CPU (FP32) HTP (FP32 DLL) Speedup
Vision Encoder 3.2s 0.7s 4.6Γ—
LLM Backbone 21.0s 1.5s 14.0Γ—
Action Head Γ—10 6.0s 1.1s (unrolled) 5.5Γ—
Model total ~30s ~3.3s ~9Γ—
Wall-clock (E2E latency) β€” ~4.8s β€”

Wall-clock includes subprocess overhead (~130ms Γ— 3 calls), temp file I/O, and NumPy transpositions on top of model execution time.

End-to-End Inference (Task 3, Trial 3)

Config E2E Latency n_action_steps Control Freq Result
HTP FP32, n=8, unrolled 4,701ms 8 1.70 Hz FAILED
HTP FP32, n=4, unrolled 4,762ms 4 0.84 Hz SUCCESS
  • E2E Latency: wall-clock time from observation capture to first action output
  • Control Freq: effective action execution rate = n_action_steps / E2E_latency

CPU vs HTP Output Similarity (cosine)

Component Cosine
Vision Encoder 0.99999
LLM kv_keys 0.889
LLM kv_values 0.768
Action Head (same KV) 0.99999
End-to-end velocity 0.986

LLM similarity is lower due to FP16 internal accumulation across 32 transformer layers.


Architecture

Observation (image + state + language)
        β”‚
        β–Ό
[Vision Encoder]  pixel_values[1,3,512,512] β†’ image_embeddings[1,64,960]
        β”‚
        β–Ό
[LLM Backbone]    image_embsΓ—2 + lang_emb[1,48,960] + state[1,32]
                  β†’ kv_keys/values[32,1,177,5,64] + prefix_pad_masks[1,177]
        β”‚
        β–Ό
[Action Head]     10-step flow matching (Euler, t:1.0β†’0.1, dt=-0.1)
  (Unrolled)      noisy_actions[1,50,32] β†’ denoised_actions[1,50,32]
        β”‚
        β–Ό
Actions (7-DoF: eef_pos + axis_angle + gripper)

Key Implementation Notes

  • FP32 DLL on HTP (not FP16): --float_bitwidth 16 causes Β±512 saturation. FP32 DLL + HTP auto-converts to FP16 internally with better precision.
  • CPU-side embedding: LLM v7 removes Gather op (HTP INT32 input bug). Embedding lookup done in Python with embed_tokens.weight.
  • Bool elimination: QNN strips bool chains. fold_and_eliminate_bool.py replaces with int32/float ops.
  • Context binaries: Pre-serialized HTP graphs for ~0.5s faster per-model loading.
  • Tensor transpositions: HTP auto-transposes some dims; see run_vision_encoder, run_llm_backbone_htp in inference script.

File Structure

β”œβ”€β”€ inference/
β”‚   └── infer_libero_episode_qnn_htp.py   # Main LIBERO inference script
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ build_all_models.py               # FP32 QNN build
β”‚   β”œβ”€β”€ build_int8w_models.py             # W8A16 QNN build
β”‚   β”œβ”€β”€ create_ah_unrolled.py             # Unrolled 10-step Action Head
β”‚   β”œβ”€β”€ gen_int8w_context_binaries.py     # Context binary generation
β”‚   β”œβ”€β”€ fold_and_eliminate_bool.py        # ONNX bool preprocessing
β”‚   β”œβ”€β”€ fix_ah_expand_bool.py             # Action Head bool fix
β”‚   β”œβ”€β”€ fix_llm_embed_on_cpu.py           # LLM v7 embedding removal
β”‚   └── fix_cpp_for_htp.py                # C++ output HTP patch
β”œβ”€β”€ qnn_models/
β”‚   β”œβ”€β”€ windows-aarch64-fp32/             # FP32 DLLs (run with HTP backend)
β”‚   └── windows-aarch64-int8w/            # W8A16 DLLs
β”œβ”€β”€ qnn_context_cache/
β”‚   β”œβ”€β”€ lib*_fp32.serialized.bin          # Pre-compiled HTP contexts
β”‚   └── libaction_head_unrolled_htp.*     # Unrolled AH (10-step)
β”œβ”€β”€ onnx_models/                          # Preprocessed ONNX models
β”œβ”€β”€ policy_preprocessor_step_5_normalizer_processor.safetensors
└── policy_postprocessor_step_1_unnormalizer_processor.safetensors

References

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support