Qwen3.5-35B-A3B INT4 Quantized (HQQ)

INT4 weight-only quantized version of Qwen/Qwen3.5-35B-A3B for ExecuTorch CUDA export.

Quantization Details

Component Method Bits Group Size
Expert weights (MoE) HQQ scale-only INT4 128
Linear layers (attention, projections, lm_head) INT4 tile-packed (tinygemm) INT4 128
Embeddings Weight-only INT8 per-axis
  • Expert quantization: Uses HQQ (Half-Quadratic Quantization) scale-only optimization with iterative least-squares scale refinement for better accuracy than standard min/max symmetric quantization.
  • Linear quantization: Uses Int4WeightOnlyConfig with tile_packed_to_4d packing format and HQQ qparams algorithm via torchao.
  • Embedding quantization: Uses IntxWeightOnlyConfig with INT8 per-axis quantization.

File Format

  • model.safetensors — Quantized weights. Tensor subclasses (Int4TilePackedTo4dTensor, IntxUnpackedToInt8Tensor) are flattened into plain inner tensors with .__qdata / .__scale_and_zero suffixes. Reconstruction metadata is stored in the safetensors header under "quantization".
  • config.json — Model architecture configuration.
  • tokenizer.json, tokenizer_config.json, merges.txt, vocab.json — Tokenizer files for runtime.

Prerequisites

How to Use

Eager Inference (Python)

# Download the quantized bundle
huggingface-cli download <repo-id> --local-dir qwen35_moe_int4_hqq

# Run inference
cd executorch/examples/models/qwen3_5_moe
python inference.py \
    --prequantized /path/to/qwen35_moe_int4_hqq \
    --prompt "The capital of France is" \
    --max-new-tokens 128

Export to ExecuTorch (.pte)

cd executorch/examples/models/qwen3_5_moe
python export.py --prequantized /path/to/qwen35_moe_int4_hqq

Build and Run (C++)

# Build ExecuTorch with CUDA support and the runner
make qwen3_5_moe-cuda

# Run inference
cmake-out/examples/models/qwen3_5_moe/qwen3_5_moe_runner \
    --model_path qwen35_moe_exports/model.pte \
    --data_path qwen35_moe_exports/aoti_cuda_blob.ptd \
    --tokenizer_path qwen35_moe_int4_hqq/tokenizer.json \
    --prompt "The meaning of life is" \
    --max_new_tokens 128

How to Reproduce

cd executorch/examples/models/qwen3_5_moe

python quantize_and_save.py \
    --model-dir /path/to/Qwen3.5-35B-A3B \
    --qlinear 4w \
    --qembedding 8w \
    --qlinear-group-size 128 \
    --hqq \
    --output qwen35_moe_int4_hqq

Requires CUDA and ~70GB RAM (for loading the original bf16 model).

Base Model

  • Model: Qwen/Qwen3.5-35B-A3B
  • Architecture: 40-layer hybrid transformer, 256 routed experts (top-8), GatedDeltaNet + full attention
  • Parameters: 35B total, 3B active per token
  • License: Apache 2.0
Downloads last month
493
Safetensors
Model size
17B params
Tensor type
F32
·
I32
·
BF16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SocialLocalMobile/Qwen3.5-35B-A3B-HQQ-INT4

Quantized
(242)
this model