Qwen3.5-35B-A3B-Freed0m EXL3 6.0bpw

EXL3 quantization (6.00 bpw) of unsloth/Qwen3.5-35B-A3B-Freed0m. Optimized for exllamav3 runtime on Ampere GPUs (RTX 3090/4090).

Quantization Details

Method: EXL3 (trellis codebook + MCG encoding)
Bits per weight: 6.0 bpw (layer), 6.0 bpw (head)
Calibration: 128 rows x 2048 cols
Scale output: always
Codebook: MCG (Multi-Codebook Generalized)
Version: exllamav3 0.0.26

Quality

Evaluated on WikiText-2 (seqlen=2048, stride=512):

Config	Tokens	Perplexity	Time
200 rows	409,400	7.09	626s
8 rows	16,376	8.26	48s

Usage

This model requires exllamav3 to run. It is not compatible with standard transformers or vLLM.

from exllamav3 import Model, Config

config = Config.from_directory("groxaxo/Qwen3.5-35B-A3B-Freed0m-EXL3-6.0bpw")
model = Model(config)
model.load()

# Use the exllamav3 API for inference

Model Architecture

Type: Qwen3.5 MoE (Mixture of Experts)
Active params: ~35B (35B-A3B)
Total experts: 256 fused experts, 40 router layers
Layers: 40
Attention mix: Linear attention (75%) + Full attention (25%), with full attention every 4th layer
Hidden size: 2048
Head dim: 256
Vision: 3D patch embedding (Conv3d) for video understanding
Activation: SiLU

File Sizes

File	Size
model-00001-of-00004.safetensors	~8.0 GB
model-00002-of-00004.safetensors	~8.3 GB
model-00003-of-00004.safetensors	~8.3 GB
model-00004-of-00004.safetensors	~3.2 GB
Total	~26 GB

Differences from Base Model

The base model (unsloth/Qwen3.5-35B-A3B-Freed0m) uses BF16 full precision weights (~70GB+). This EXL3 variant replaces raw weight matrices with trellis codebook indices (int16) + scale vectors (suh/svh) + MCG multipliers, achieving ~3:1 compression while preserving quality with minimal perplexity degradation.

Hardware Requirements

GPU: Ampere or newer (RTX 3090, 4090, A6000, etc.)
VRAM: ~26 GB (fits on a single 24 GB GPU with offloading)
Runtime: exllamav3 >= 0.0.26

Downloads last month: 258

Safetensors

Model size

14B params

Tensor type

F16

I16

F32

BF16