Llama 3.2 1B Instruct - QNN HTP Z4 Quantized

Pre-compiled model binary for Qualcomm Hexagon HTP NPU inference using QNN GenAI Transformer backend.

Model Details

Property Value
Base Model meta-llama/Llama-3.2-1B-Instruct
Quantization Z4 (Qualcomm 4-bit)
Format QNN GenAI Transformer single binary
SDK Version QAIRT v2.38.0.250901
Target Hardware Qualcomm Hexagon HTP v73+ NPU
Binary Size ~1.6 GB
Tensors 147 (Z4 for weights, F32 for norms)

Tested Hardware

  • Qualcomm IQ-9075 EVK (QCS9075 SoC, Hexagon HTP v73)

Usage

This binary is designed for use with the Qualcomm Genie runtime (libGenie.so) or genie-t2t-run CLI.

With genie-t2t-run

cd /path/to/model/
LD_LIBRARY_PATH=/path/to/qnn-libs:/usr/lib \
ADSP_LIBRARY_PATH="/usr/lib/dsp/cdsp;/usr/lib/dsp/cdsp1" \
genie-t2t-run -c genie_config.json -p "Hello, how are you?"

genie_config.json

{
  "dialog": {
    "backend": "QnnGenAiTransformer",
    "model-path": "llama3.2-1b-instruct-z4.bin",
    "tokenizer": "tokenizer.json"
  }
}

Compilation

Compiled using QAIRT SDK v2.38.0 GenAI Transformer Composer:

  • Source: meta-llama/Llama-3.2-1B-Instruct (HuggingFace)
  • Quantization: Z4 (4-bit weights, F32 normalization layers)
  • Compile time: ~0.3 minutes on x86_64

License

This model inherits the Llama 3.2 Community License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zededa/Llama-3.2-1B-Instruct-QNN-HTP-Z4

Finetuned
(1594)
this model