Qwen3.5-35B-A3B-Freed0m EXL3 6.0bpw
EXL3 quantization (6.00 bpw) of unsloth/Qwen3.5-35B-A3B-Freed0m. Optimized for exllamav3 runtime on Ampere GPUs (RTX 3090/4090).
Quantization Details
- Method: EXL3 (trellis codebook + MCG encoding)
- Bits per weight: 6.0 bpw (layer), 6.0 bpw (head)
- Calibration: 128 rows x 2048 cols
- Scale output: always
- Codebook: MCG (Multi-Codebook Generalized)
- Version: exllamav3 0.0.26
Quality
Evaluated on WikiText-2 (seqlen=2048, stride=512):
| Config | Tokens | Perplexity | Time |
|---|---|---|---|
| 200 rows | 409,400 | 7.09 | 626s |
| 8 rows | 16,376 | 8.26 | 48s |
Usage
This model requires exllamav3 to run. It is not compatible with standard transformers or vLLM.
from exllamav3 import Model, Config
config = Config.from_directory("groxaxo/Qwen3.5-35B-A3B-Freed0m-EXL3-6.0bpw")
model = Model(config)
model.load()
# Use the exllamav3 API for inference
Model Architecture
- Type: Qwen3.5 MoE (Mixture of Experts)
- Active params: ~35B (35B-A3B)
- Total experts: 256 fused experts, 40 router layers
- Layers: 40
- Attention mix: Linear attention (75%) + Full attention (25%), with full attention every 4th layer
- Hidden size: 2048
- Head dim: 256
- Vision: 3D patch embedding (Conv3d) for video understanding
- Activation: SiLU
File Sizes
| File | Size |
|---|---|
| model-00001-of-00004.safetensors | ~8.0 GB |
| model-00002-of-00004.safetensors | ~8.3 GB |
| model-00003-of-00004.safetensors | ~8.3 GB |
| model-00004-of-00004.safetensors | ~3.2 GB |
| Total | ~26 GB |
Differences from Base Model
The base model (unsloth/Qwen3.5-35B-A3B-Freed0m) uses BF16 full precision weights (~70GB+). This EXL3 variant replaces raw weight matrices with trellis codebook indices (int16) + scale vectors (suh/svh) + MCG multipliers, achieving ~3:1 compression while preserving quality with minimal perplexity degradation.
Hardware Requirements
- GPU: Ampere or newer (RTX 3090, 4090, A6000, etc.)
- VRAM: ~26 GB (fits on a single 24 GB GPU with offloading)
- Runtime: exllamav3 >= 0.0.26
- Downloads last month
- 258