Qwen3.5-35B-A3B INT4 Quantized (HQQ)
INT4 weight-only quantized version of Qwen/Qwen3.5-35B-A3B for ExecuTorch CUDA export.
Quantization Details
| Component | Method | Bits | Group Size |
|---|---|---|---|
| Expert weights (MoE) | HQQ scale-only | INT4 | 128 |
| Linear layers (attention, projections, lm_head) | INT4 tile-packed (tinygemm) | INT4 | 128 |
| Embeddings | Weight-only | INT8 | per-axis |
- Expert quantization: Uses HQQ (Half-Quadratic Quantization) scale-only optimization with iterative least-squares scale refinement for better accuracy than standard min/max symmetric quantization.
- Linear quantization: Uses
Int4WeightOnlyConfigwithtile_packed_to_4dpacking format and HQQ qparams algorithm via torchao. - Embedding quantization: Uses
IntxWeightOnlyConfigwith INT8 per-axis quantization.
File Format
model.safetensors— Quantized weights. Tensor subclasses (Int4TilePackedTo4dTensor,IntxUnpackedToInt8Tensor) are flattened into plain inner tensors with.__qdata/.__scale_and_zerosuffixes. Reconstruction metadata is stored in the safetensors header under"quantization".config.json— Model architecture configuration.tokenizer.json,tokenizer_config.json,merges.txt,vocab.json— Tokenizer files for runtime.
Prerequisites
- ExecuTorch installed from source (see building from source)
- safetensors (
pip install safetensors) - NVIDIA GPU with CUDA toolkit
How to Use
Eager Inference (Python)
# Download the quantized bundle
huggingface-cli download <repo-id> --local-dir qwen35_moe_int4_hqq
# Run inference
cd executorch/examples/models/qwen3_5_moe
python inference.py \
--prequantized /path/to/qwen35_moe_int4_hqq \
--prompt "The capital of France is" \
--max-new-tokens 128
Export to ExecuTorch (.pte)
cd executorch/examples/models/qwen3_5_moe
python export.py --prequantized /path/to/qwen35_moe_int4_hqq
Build and Run (C++)
# Build ExecuTorch with CUDA support and the runner
make qwen3_5_moe-cuda
# Run inference
cmake-out/examples/models/qwen3_5_moe/qwen3_5_moe_runner \
--model_path qwen35_moe_exports/model.pte \
--data_path qwen35_moe_exports/aoti_cuda_blob.ptd \
--tokenizer_path qwen35_moe_int4_hqq/tokenizer.json \
--prompt "The meaning of life is" \
--max_new_tokens 128
How to Reproduce
cd executorch/examples/models/qwen3_5_moe
python quantize_and_save.py \
--model-dir /path/to/Qwen3.5-35B-A3B \
--qlinear 4w \
--qembedding 8w \
--qlinear-group-size 128 \
--hqq \
--output qwen35_moe_int4_hqq
Requires CUDA and ~70GB RAM (for loading the original bf16 model).
Base Model
- Model: Qwen/Qwen3.5-35B-A3B
- Architecture: 40-layer hybrid transformer, 256 routed experts (top-8), GatedDeltaNet + full attention
- Parameters: 35B total, 3B active per token
- License: Apache 2.0
- Downloads last month
- 493