Llama 3.2 1B Instruct - QNN HTP Z4 Quantized

Pre-compiled model binary for Qualcomm Hexagon HTP NPU inference using QNN GenAI Transformer backend.

Model Details

Property	Value
Base Model	meta-llama/Llama-3.2-1B-Instruct
Quantization	Z4 (Qualcomm 4-bit)
Format	QNN GenAI Transformer single binary
SDK Version	QAIRT v2.38.0.250901
Target Hardware	Qualcomm Hexagon HTP v73+ NPU
Binary Size	~1.6 GB
Tensors	147 (Z4 for weights, F32 for norms)

Tested Hardware

Qualcomm IQ-9075 EVK (QCS9075 SoC, Hexagon HTP v73)

Usage

This binary is designed for use with the Qualcomm Genie runtime (libGenie.so) or genie-t2t-run CLI.

With genie-t2t-run

cd /path/to/model/
LD_LIBRARY_PATH=/path/to/qnn-libs:/usr/lib \
ADSP_LIBRARY_PATH="/usr/lib/dsp/cdsp;/usr/lib/dsp/cdsp1" \
genie-t2t-run -c genie_config.json -p "Hello, how are you?"

genie_config.json

{
  "dialog": {
    "backend": "QnnGenAiTransformer",
    "model-path": "llama3.2-1b-instruct-z4.bin",
    "tokenizer": "tokenizer.json"
  }
}

Compilation

Compiled using QAIRT SDK v2.38.0 GenAI Transformer Composer:

Source: meta-llama/Llama-3.2-1B-Instruct (HuggingFace)
Quantization: Z4 (4-bit weights, F32 normalization layers)
Compile time: ~0.3 minutes on x86_64

License

This model inherits the Llama 3.2 Community License.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for zededa/Llama-3.2-1B-Instruct-QNN-HTP-Z4

Base model

meta-llama/Llama-3.2-1B-Instruct

Finetuned

(1594)

this model