Qwen3.5-27B — EXL3 3.0bpw (h6)

ExLlamaV3 EXL3 quantization of Qwen/Qwen3.5-27B at 3.0 bits per weight with 6-bit lm_head.

Property Value
Original model Qwen/Qwen3.5-27B
Quantization format EXL3 (trellis-coded, QTIP-based)
Average bpw 3.0
Head bits (lm_head) 6
Model weights size ~10.4 GB
Measured VRAM (loaded) ~12.7 GB (H100, 8k context, Q8 KV cache)
Architecture Hybrid GDN + Full Attention (64 layers: 48× linear + 16× full)
Context length 262,144 native (up to 1M with YaRN)
Vision Yes — unified vision-language (ViT encoder included)
Languages 201 languages

VRAM Requirements

Thanks to the hybrid Gated DeltaNet architecture, only 16 of 64 layers use full attention with KV cache. The remaining 48 layers use fixed-size recurrent state. This makes long-context inference dramatically cheaper.

Context Length Q8 KV Cache Q4 KV Cache Total VRAM (Q8) Total VRAM (Q4)
8,192 ~0.25 GB ~0.12 GB ~11.2 GB ~11.1 GB
32,768 ~1.0 GB ~0.5 GB ~11.9 GB ~11.4 GB
65,536 ~2.0 GB ~1.0 GB ~12.9 GB ~11.9 GB
131,072 ~4.0 GB ~2.0 GB ~14.9 GB ~12.9 GB
163,840 ~4.9 GB ~2.4 GB ~15.8 GB ~13.3 GB

Fits on a 16 GB GPU (RTX 5080 / RTX 4060 Ti 16GB) with 160k+ context using Q8 KV cache.

Measured: Model loads at ~12.7 GB VRAM with default 8k cache at Q8 precision — leaving ~3.3 GB headroom on a 16 GB card for KV cache expansion to 100k+ tokens.


Quick Start

Option 1: TabbyAPI (Recommended — OpenAI-compatible server)

# Clone TabbyAPI
git clone https://github.com/theroyallab/tabbyAPI
cd tabbyAPI

# Edit config.yml
cat > config.yml << 'EOF'
network:
  host: 0.0.0.0
  port: 5000

backend: exllamav3

model:
  model_dir: /path/to/models
  model_name: Qwen3.5-27B-EXL3-3.0bpw-h6
  max_seq_len: 163840
  cache_size: 163840
  cache_mode: "8,8"
  chunk_size: 2048
  vision: true
EOF

# Start the server
python main.py

Then query:

curl http://localhost:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <your-api-key>" \
  -d '{
    "model": "Qwen3.5-27B-EXL3-3.0bpw-h6",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 512,
    "temperature": 0.7,
    "min_p": 0.1
  }'

Option 2: ExLlamaV3 Python API (Direct)

from exllamav3 import Model, Tokenizer, Cache, Generator

model_dir = "/path/to/Qwen3.5-27B-EXL3-3.0bpw-h6"

model = Model(model_dir)
model.load()

tokenizer = Tokenizer(model)
cache = Cache(model, max_num_tokens=32768)
generator = Generator(model, tokenizer, cache)

output = generator.generate(
    prompt="<|im_start|>user\nExplain quantum computing briefly.<|im_end|>\n<|im_start|>assistant\n",
    max_new_tokens=512,
    temperature=0.7,
    min_p=0.1,
)
print(output)

Dependencies

Via TabbyAPI (easiest)

TabbyAPI's setup script installs everything automatically — ExLlamaV3, Triton, Flash Attention, and all CUDA dependencies. No manual pip installs needed.

git clone https://github.com/theroyallab/tabbyAPI
cd tabbyAPI
# Run the setup script — it creates a venv and installs all deps
./start.sh

Via ExLlamaV3 directly

If running without TabbyAPI:

pip install exllamav3

# Required for Qwen3.5 Gated DeltaNet layers
pip install triton

# Optional — slightly improves GDN layer performance
pip install causal-conv1d

Note: Qwen3.5's Flash Linear Attention requires Triton. First inference may be slow due to JIT kernel compilation — subsequent runs are faster once Triton caches the compiled kernels.


Quantization Details

Quantized using ExLlamaV3 convert.py:

python convert.py \
  -i Qwen/Qwen3.5-27B \
  -o Qwen3.5-27B-EXL3-3.0bpw-h6 \
  -w /tmp/exl3-work \
  -b 3.0 \
  -hb 6

EXL3 uses trellis-coded quantization (a streamlined variant of QTIP) with Hadamard transforms, LDL decomposition, and Viterbi-optimal encoding to achieve significantly better quality-per-bit than naive round-to-nearest methods. At 3.0 bpw, EXL3 retains excellent perplexity and downstream task quality.


About Qwen3.5-27B

Qwen3.5-27B is Alibaba's dense multimodal foundation model featuring a hybrid architecture combining Gated Delta Networks (linear attention) and standard full attention in a 3:1 ratio. This design enables efficient long-context inference with minimal KV cache memory growth.

Key Highlights

  • Unified Vision-Language: Early fusion training on multimodal tokens
  • Efficient Hybrid Architecture: Gated Delta Networks + full attention (3:1 ratio)
  • 262K native context (extensible to 1M via YaRN)
  • 201 languages supported
  • Competitive benchmarks: GPQA Diamond 85.5, SWE-bench Verified 72.4, LiveCodeBench v6 80.7

Architecture

64 layers: 16 × (3 × (Gated DeltaNet → FFN) → 1 × (Full Attention → FFN))

Gated DeltaNet (linear attention):
  - Q/K heads: 16, head_dim: 128
  - V heads: 48, head_dim: 128
  - Fixed-size recurrent state (no KV cache growth)

Full Attention (standard GQA):
  - Q heads: 24, KV heads: 4, head_dim: 256
  - RoPE dimension: 64
  - Standard KV cache (grows with context)

FFN: intermediate_size 17408, SiLU activation
Vision: 27-layer ViT, 1152 hidden, patch_size 16

Benchmark Results (Selected)

Benchmark Qwen3.5-27B GPT-5-mini Qwen3-235B-A22B
MMLU-Pro 86.1 83.7 84.4
GPQA Diamond 85.5 82.8 81.1
SWE-bench Verified 72.4 72.0 --
LiveCodeBench v6 80.7 80.5 75.1
IFEval 95.0 93.9 87.8
BFCL-V4 (tool use) 68.5 55.5 54.8
TAU2-Bench (agents) 79.0 69.8 58.5

Vision Benchmarks (Selected)

Benchmark Qwen3.5-27B GPT-5-mini Claude Sonnet 4.5
MMMU 82.3 79.0 79.6
MMMU-Pro 75.0 67.3 68.4
MathVision 86.0 71.9 71.1
VideoMME (w/ sub) 87.0 83.5 81.1
OmniDocBench1.5 88.9 77.0 85.8

Recommended Sampling Parameters

Official recommendations from Qwen:

Mode temperature top_p top_k min_p presence_penalty repetition_penalty
Thinking — general tasks 1.0 0.95 20 0.0 1.5 1.0
Thinking — precise coding (e.g. WebDev) 0.6 0.95 20 0.0 0.0 1.0
Non-thinking — general tasks 0.7 0.8 20 0.0 1.5 1.0
Non-thinking — reasoning tasks 1.0 0.95 20 0.0 1.5 1.0

Support for sampling parameters varies across inference frameworks. TabbyAPI supports all of the above.


Compatibility

Backend Supported
TabbyAPI ✅ Official backend for ExLlamaV3
ExLlamaV3 Python ✅ Direct API
Open WebUI ✅ Via TabbyAPI OpenAI endpoint
SillyTavern ✅ Via TabbyAPI
vLLM ⏳ Not yet — format is designed to be portable (retains original tensor names)
SGLang ⏳ Not yet — same as above
Transformers ⏳ Not yet — same as above
llama.cpp / Ollama ❌ Use GGUF quants instead

Note on framework portability: Unlike EXL2, the EXL3 format retains the original HuggingFace tensor naming structure, making it feasible to extend support to vLLM, SGLang, and HF Transformers in the future. As of March 2026, ExLlamaV3 + TabbyAPI is the only supported inference stack.


Credits

License

Apache 2.0 — same as the original Qwen3.5-27B model.

Downloads last month
48
Safetensors
Model size
7B params
Tensor type
F16
·
I16
·
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ykarout/Qwen3.5-27B-exl3-3.0bpw

Base model

Qwen/Qwen3.5-27B
Quantized
(202)
this model