Qwen3.5-35B-A3B INT4 Quantized (HQQ)

INT4 weight-only quantized version of Qwen/Qwen3.5-35B-A3B for ExecuTorch CUDA export.

Quantization Details

Component	Method	Bits	Group Size
Expert weights (MoE)	HQQ scale-only	INT4	128
Linear layers (attention, projections, lm_head)	INT4 tile-packed (tinygemm)	INT4	128
Embeddings	Weight-only	INT8	per-axis

Expert quantization: Uses HQQ (Half-Quadratic Quantization) scale-only optimization with iterative least-squares scale refinement for better accuracy than standard min/max symmetric quantization.
Linear quantization: Uses Int4WeightOnlyConfig with tile_packed_to_4d packing format and HQQ qparams algorithm via torchao.
Embedding quantization: Uses IntxWeightOnlyConfig with INT8 per-axis quantization.

File Format

model.safetensors — Quantized weights. Tensor subclasses (Int4TilePackedTo4dTensor, IntxUnpackedToInt8Tensor) are flattened into plain inner tensors with .__qdata / .__scale_and_zero suffixes. Reconstruction metadata is stored in the safetensors header under "quantization".
config.json — Model architecture configuration.
tokenizer.json, tokenizer_config.json, merges.txt, vocab.json — Tokenizer files for runtime.

Prerequisites

ExecuTorch installed from source (see building from source)
safetensors (pip install safetensors)
NVIDIA GPU with CUDA toolkit

How to Use

Eager Inference (Python)

# Download the quantized bundle
huggingface-cli download <repo-id> --local-dir qwen35_moe_int4_hqq

# Run inference
cd executorch/examples/models/qwen3_5_moe
python inference.py \
    --prequantized /path/to/qwen35_moe_int4_hqq \
    --prompt "The capital of France is" \
    --max-new-tokens 128

Export to ExecuTorch (.pte)

cd executorch/examples/models/qwen3_5_moe
python export.py --prequantized /path/to/qwen35_moe_int4_hqq

Build and Run (C++)

# Build ExecuTorch with CUDA support and the runner
make qwen3_5_moe-cuda

# Run inference
cmake-out/examples/models/qwen3_5_moe/qwen3_5_moe_runner \
    --model_path qwen35_moe_exports/model.pte \
    --data_path qwen35_moe_exports/aoti_cuda_blob.ptd \
    --tokenizer_path qwen35_moe_int4_hqq/tokenizer.json \
    --prompt "The meaning of life is" \
    --max_new_tokens 128

How to Reproduce

cd executorch/examples/models/qwen3_5_moe

python quantize_and_save.py \
    --model-dir /path/to/Qwen3.5-35B-A3B \
    --qlinear 4w \
    --qembedding 8w \
    --qlinear-group-size 128 \
    --hqq \
    --output qwen35_moe_int4_hqq

Requires CUDA and ~70GB RAM (for loading the original bf16 model).

Base Model

Model: Qwen/Qwen3.5-35B-A3B
Architecture: 40-layer hybrid transformer, 256 routed experts (top-8), GatedDeltaNet + full attention
Parameters: 35B total, 3B active per token
License: Apache 2.0

Downloads last month: 493

Safetensors

Model size

17B params

Tensor type

F32

I32

BF16

Model tree for SocialLocalMobile/Qwen3.5-35B-A3B-HQQ-INT4

Base model

Qwen/Qwen3.5-35B-A3B-Base

Finetuned

Qwen/Qwen3.5-35B-A3B

Quantized

(242)

this model