Qwen3.5-27B — EXL3 3.0bpw (h6)

ExLlamaV3 EXL3 quantization of Qwen/Qwen3.5-27B at 3.0 bits per weight with 6-bit lm_head.

Property	Value
Original model	Qwen/Qwen3.5-27B
Quantization format	EXL3 (trellis-coded, QTIP-based)
Average bpw	3.0
Head bits (lm_head)	6
Model weights size	~10.4 GB
Measured VRAM (loaded)	~12.7 GB (H100, 8k context, Q8 KV cache)
Architecture	Hybrid GDN + Full Attention (64 layers: 48× linear + 16× full)
Context length	262,144 native (up to 1M with YaRN)
Vision	Yes — unified vision-language (ViT encoder included)
Languages	201 languages

VRAM Requirements

Thanks to the hybrid Gated DeltaNet architecture, only 16 of 64 layers use full attention with KV cache. The remaining 48 layers use fixed-size recurrent state. This makes long-context inference dramatically cheaper.

Context Length	Q8 KV Cache	Q4 KV Cache	Total VRAM (Q8)	Total VRAM (Q4)
8,192	~0.25 GB	~0.12 GB	~11.2 GB	~11.1 GB
32,768	~1.0 GB	~0.5 GB	~11.9 GB	~11.4 GB
65,536	~2.0 GB	~1.0 GB	~12.9 GB	~11.9 GB
131,072	~4.0 GB	~2.0 GB	~14.9 GB	~12.9 GB
163,840	~4.9 GB	~2.4 GB	~15.8 GB	~13.3 GB

Fits on a 16 GB GPU (RTX 5080 / RTX 4060 Ti 16GB) with 160k+ context using Q8 KV cache.

Measured: Model loads at ~12.7 GB VRAM with default 8k cache at Q8 precision — leaving ~3.3 GB headroom on a 16 GB card for KV cache expansion to 100k+ tokens.

Quick Start

Option 1: TabbyAPI (Recommended — OpenAI-compatible server)

# Clone TabbyAPI
git clone https://github.com/theroyallab/tabbyAPI
cd tabbyAPI

# Edit config.yml
cat > config.yml << 'EOF'
network:
  host: 0.0.0.0
  port: 5000

backend: exllamav3

model:
  model_dir: /path/to/models
  model_name: Qwen3.5-27B-EXL3-3.0bpw-h6
  max_seq_len: 163840
  cache_size: 163840
  cache_mode: "8,8"
  chunk_size: 2048
  vision: true
EOF

# Start the server
python main.py

Then query:

curl http://localhost:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer <your-api-key>" \
  -d '{
    "model": "Qwen3.5-27B-EXL3-3.0bpw-h6",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 512,
    "temperature": 0.7,
    "min_p": 0.1
  }'

Option 2: ExLlamaV3 Python API (Direct)

from exllamav3 import Model, Tokenizer, Cache, Generator

model_dir = "/path/to/Qwen3.5-27B-EXL3-3.0bpw-h6"

model = Model(model_dir)
model.load()

tokenizer = Tokenizer(model)
cache = Cache(model, max_num_tokens=32768)
generator = Generator(model, tokenizer, cache)

output = generator.generate(
    prompt="<|im_start|>user\nExplain quantum computing briefly.<|im_end|>\n<|im_start|>assistant\n",
    max_new_tokens=512,
    temperature=0.7,
    min_p=0.1,
)
print(output)

Dependencies

Via TabbyAPI (easiest)

TabbyAPI's setup script installs everything automatically — ExLlamaV3, Triton, Flash Attention, and all CUDA dependencies. No manual pip installs needed.

git clone https://github.com/theroyallab/tabbyAPI
cd tabbyAPI
# Run the setup script — it creates a venv and installs all deps
./start.sh

Via ExLlamaV3 directly

If running without TabbyAPI:

pip install exllamav3

# Required for Qwen3.5 Gated DeltaNet layers
pip install triton

# Optional — slightly improves GDN layer performance
pip install causal-conv1d

Note: Qwen3.5's Flash Linear Attention requires Triton. First inference may be slow due to JIT kernel compilation — subsequent runs are faster once Triton caches the compiled kernels.

Quantization Details

Quantized using ExLlamaV3 convert.py:

python convert.py \
  -i Qwen/Qwen3.5-27B \
  -o Qwen3.5-27B-EXL3-3.0bpw-h6 \
  -w /tmp/exl3-work \
  -b 3.0 \
  -hb 6

EXL3 uses trellis-coded quantization (a streamlined variant of QTIP) with Hadamard transforms, LDL decomposition, and Viterbi-optimal encoding to achieve significantly better quality-per-bit than naive round-to-nearest methods. At 3.0 bpw, EXL3 retains excellent perplexity and downstream task quality.

About Qwen3.5-27B

Qwen3.5-27B is Alibaba's dense multimodal foundation model featuring a hybrid architecture combining Gated Delta Networks (linear attention) and standard full attention in a 3:1 ratio. This design enables efficient long-context inference with minimal KV cache memory growth.

Key Highlights

Unified Vision-Language: Early fusion training on multimodal tokens
Efficient Hybrid Architecture: Gated Delta Networks + full attention (3:1 ratio)
262K native context (extensible to 1M via YaRN)
201 languages supported
Competitive benchmarks: GPQA Diamond 85.5, SWE-bench Verified 72.4, LiveCodeBench v6 80.7

Architecture

64 layers: 16 × (3 × (Gated DeltaNet → FFN) → 1 × (Full Attention → FFN))

Gated DeltaNet (linear attention):
  - Q/K heads: 16, head_dim: 128
  - V heads: 48, head_dim: 128
  - Fixed-size recurrent state (no KV cache growth)

Full Attention (standard GQA):
  - Q heads: 24, KV heads: 4, head_dim: 256
  - RoPE dimension: 64
  - Standard KV cache (grows with context)

FFN: intermediate_size 17408, SiLU activation
Vision: 27-layer ViT, 1152 hidden, patch_size 16

Benchmark Results (Selected)

Benchmark	Qwen3.5-27B	GPT-5-mini	Qwen3-235B-A22B
MMLU-Pro	86.1	83.7	84.4
GPQA Diamond	85.5	82.8	81.1
SWE-bench Verified	72.4	72.0	--
LiveCodeBench v6	80.7	80.5	75.1
IFEval	95.0	93.9	87.8
BFCL-V4 (tool use)	68.5	55.5	54.8
TAU2-Bench (agents)	79.0	69.8	58.5

Vision Benchmarks (Selected)

Benchmark	Qwen3.5-27B	GPT-5-mini	Claude Sonnet 4.5
MMMU	82.3	79.0	79.6
MMMU-Pro	75.0	67.3	68.4
MathVision	86.0	71.9	71.1
VideoMME (w/ sub)	87.0	83.5	81.1
OmniDocBench1.5	88.9	77.0	85.8

Recommended Sampling Parameters

Official recommendations from Qwen:

Mode	temperature	top_p	top_k	presence_penalty	repetition_penalty
Thinking — general tasks	1.0	0.95	20	1.5	1.0
Thinking — precise coding (e.g. WebDev)	0.6	0.95	20	0.0	1.0
Non-thinking — general tasks	0.7	0.8	20	1.5	1.0
Non-thinking — reasoning tasks	1.0	0.95	20	1.5	1.0

Support for sampling parameters varies across inference frameworks. TabbyAPI supports all of the above.

Compatibility

Backend	Supported
TabbyAPI	✅ Official backend for ExLlamaV3
ExLlamaV3 Python	✅ Direct API
Open WebUI	✅ Via TabbyAPI OpenAI endpoint
SillyTavern	✅ Via TabbyAPI
vLLM	⏳ Not yet — format is designed to be portable (retains original tensor names)
SGLang	⏳ Not yet — same as above
Transformers	⏳ Not yet — same as above
llama.cpp / Ollama	❌ Use GGUF quants instead

Note on framework portability: Unlike EXL2, the EXL3 format retains the original HuggingFace tensor naming structure, making it feasible to extend support to vLLM, SGLang, and HF Transformers in the future. As of March 2026, ExLlamaV3 + TabbyAPI is the only supported inference stack.

Credits

Original model: Qwen Team, Alibaba — Apache 2.0
Quantization: ExLlamaV3 by turboderp
Serving: TabbyAPI by theroyallab

License

Apache 2.0 — same as the original Qwen3.5-27B model.

Downloads last month: 48

Safetensors

Model size

7B params

Tensor type

F16

I16

F32

BF16

Model tree for ykarout/Qwen3.5-27B-exl3-3.0bpw

Base model

Qwen/Qwen3.5-27B

Quantized

(202)

this model