Llama 3.2 1B Instruct - QNN HTP Z4 Quantized
Pre-compiled model binary for Qualcomm Hexagon HTP NPU inference using QNN GenAI Transformer backend.
Model Details
| Property | Value |
|---|---|
| Base Model | meta-llama/Llama-3.2-1B-Instruct |
| Quantization | Z4 (Qualcomm 4-bit) |
| Format | QNN GenAI Transformer single binary |
| SDK Version | QAIRT v2.38.0.250901 |
| Target Hardware | Qualcomm Hexagon HTP v73+ NPU |
| Binary Size | ~1.6 GB |
| Tensors | 147 (Z4 for weights, F32 for norms) |
Tested Hardware
- Qualcomm IQ-9075 EVK (QCS9075 SoC, Hexagon HTP v73)
Usage
This binary is designed for use with the Qualcomm Genie runtime (libGenie.so) or genie-t2t-run CLI.
With genie-t2t-run
cd /path/to/model/
LD_LIBRARY_PATH=/path/to/qnn-libs:/usr/lib \
ADSP_LIBRARY_PATH="/usr/lib/dsp/cdsp;/usr/lib/dsp/cdsp1" \
genie-t2t-run -c genie_config.json -p "Hello, how are you?"
genie_config.json
{
"dialog": {
"backend": "QnnGenAiTransformer",
"model-path": "llama3.2-1b-instruct-z4.bin",
"tokenizer": "tokenizer.json"
}
}
Compilation
Compiled using QAIRT SDK v2.38.0 GenAI Transformer Composer:
- Source:
meta-llama/Llama-3.2-1B-Instruct(HuggingFace) - Quantization: Z4 (4-bit weights, F32 normalization layers)
- Compile time: ~0.3 minutes on x86_64
License
This model inherits the Llama 3.2 Community License.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for zededa/Llama-3.2-1B-Instruct-QNN-HTP-Z4
Base model
meta-llama/Llama-3.2-1B-Instruct