base_model_relation: quantized
Qwen3.5-27B - TevunahAi Multi-Level GPTQ
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3.5-27B |
| Architecture | HYBRID (Gated DeltaNet + Full Attention, VLM) |
| Parameters | 27B (dense) |
| Context Length | 262K (native), extensible to ~1M |
| Quantization | TevunahAi Multi-Level GPTQ + EoRA |
| Original Size | ~54 GB (BF16) |
| Quantized Size | 17.8 GB |
| Compression | 67.0% reduction |
Model Description
Qwen3.5-27B is Alibaba Cloud's dense multimodal foundation model, the largest in the Qwen3.5 dense lineup. It uses a hybrid architecture combining Gated Delta Networks (linear attention) with standard full attention in a 3:1 pattern across 64 decoder layers. Natively multimodal — trained from scratch on text, images, and video through early fusion. It outperforms many 70B+ models from previous generations on reasoning and coding benchmarks.
This quantization preserves the full hybrid architecture with precision-aware layer treatment, applying EoRA rank-64 error correction across all quantizable layers for maximum quality retention.
Architecture Details
Qwen3.5-27B uses a 64-layer text decoder arranged in repeating 4-layer blocks:
Block Pattern (×16): [DeltaNet+MLP] × 3 → [FullAttn+MLP] × 1
| Component | Specification |
|---|---|
| Total Decoder Layers | 64 |
| Gated DeltaNet Layers | 48 (linear attention, O(n) complexity) |
| Full Attention Layers | 16 (GQA: 24 Q heads, 4 KV heads, head_dim=256) |
| Dense MLP | SiLU activation, intermediate_size=17,408 |
| Hidden Size | 5,120 |
| Vocab Size | 248K |
| Vision Encoder | DeepStack ViT, 27 layers, patch_size=16, hidden_size=1,152 |
| Multimodal RoPE | Interleaved sections [11, 11, 10] |
| Multi-Token Prediction | 1 MTP layer |
| Languages | 201 languages and dialects |
Why This Architecture Matters:
- Gated DeltaNet provides constant memory complexity for long-range context — 262K native context with linear scaling instead of quadratic
- Full attention layers at every 4th position ensure precise information retrieval
- 27B dense model achieves GPQA Diamond 85.5% and SWE-bench Verified 72.4%, outperforming many 70B+ models from previous generations
- Fixes 72% of real GitHub issues on SWE-bench — not an abstract metric, direct utility for agentic coding
Quantization Strategy
TevunahAi Multi-Level GPTQ with EoRA rank-64 applied to ALL quantizable layers. No MoE routing in the 27B variant means EoRA is affordable across the full model, providing comprehensive error correction.
| Component | Precision | Rationale |
|---|---|---|
| Full Attention (q/k/v/o_proj) — 16 layers | INT8 + EoRA rank-64 | Quality preservation for critical attention layers |
| Linear Attention (in_proj_qkv/in_proj_z/out_proj) — 48 layers | INT4 + EoRA rank-64 | DeltaNet projections, EoRA compensates for aggressive compression |
| Dense MLP (gate/up/down_proj) — 64 layers | INT4 + EoRA rank-64 | Standard MLP compression with error correction |
| Vision Encoder — 27 layers | FP16 (unquantized) | Full precision preserved for visual understanding |
| Embeddings & LM Head | FP16 | Preserved for accuracy |
Calibration
- 2,048 samples (8× industry standard of 256)
- 4,096 sequence length
- Premium calibration for superior quality retention
Performance Benchmarks
Original Model (Qwen benchmarks):
| Benchmark | Score |
|---|---|
| MMLU-Pro | 86.1% |
| GPQA Diamond | 85.5% |
| SWE-bench Verified | 72.4% |
| Terminal-Bench 2.0 | 41.6% |
| MMMU-Pro (vision) | 74.0% |
| MathVision | 85.8% |
| LongBench v2 | 60.7% |
| VideoMME (w/ subs) | 86.5% |
| MMMLU (multilingual) | 84.5% |
Notable Comparisons:
- Outperforms Gemma 3 27B by 18.6 points on MMLU-Pro (86.1 vs 67.5) and 43 points on GPQA Diamond (85.5 vs 42.4)
- SWE-bench Verified 72.4% — fixes 72% of real GitHub issues, competitive with models at much larger scale
- Surpasses many previous-generation 70B+ models across reasoning and coding benchmarks
Expected Quantized Performance:
- Reasoning tasks: 97-99% of baseline (EoRA recovery on attention layers)
- Code generation: 96-98% of baseline
- Long context: 96-99% of baseline (DeltaNet advantage + EoRA compensation)
- Vision tasks: 99-100% of baseline (encoder preserved at FP16)
- General chat: 98-99% of baseline
Formal benchmarks pending — inference quality verified manually.
Usage
GPTQModel (Recommended):
from gptqmodel import GPTQModel
from transformers import AutoTokenizer
model = GPTQModel.load(
"TevunahAi/Qwen3.5-27B-TevunahAi-GPTQ",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"TevunahAi/Qwen3.5-27B-TevunahAi-GPTQ",
trust_remote_code=True
)
# Generate
prompt = "Explain the difference between linear and quadratic attention complexity."
output = model.generate(
**tokenizer(prompt, return_tensors='pt').to('cuda'),
max_new_tokens=256
)
print(tokenizer.decode(output[0]))
With Thinking Mode (default):
messages = [{"role": "user", "content": "Solve: What is the integral of x²·sin(x)?"}]
tokenized = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt"
).to('cuda')
outputs = model.generate(
tokenized,
max_new_tokens=2048,
temperature=1.0,
top_p=0.95,
top_k=20,
presence_penalty=1.5
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Direct Response (disable thinking):
tokenized = tokenizer.apply_chat_template(
messages,
tokenize=True,
enable_thinking=False,
add_generation_prompt=True,
return_tensors="pt"
).to('cuda')
outputs = model.generate(
tokenized,
max_new_tokens=128,
do_sample=False,
num_beams=1
)
Installation:
pip install gptqmodel transformers>=4.48
vLLM (Experimental):
pip install -U "vllm>=0.12.0"
vllm serve TevunahAi/Qwen3.5-27B-TevunahAi-GPTQ \
--max-num-seqs 8 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--trust-remote-code
Memory Requirements
Inference (quantized model):
- Minimum: 20 GB VRAM (short context)
- Recommended: 24-32 GB VRAM
- For 262K context: 48 GB+ VRAM
Quantization (reproduction):
- Hardware Used: Dual Xeon Max 9480 (128GB HBM2e + 256GB DDR5) + RTX 5000 Ada 32GB
- Time: 208.3 minutes (3.5 hours)
Quantization Details
| Specification | Value |
|---|---|
| Method | GPTQ + EoRA rank-64 |
| Quantizer | GPTQModel |
| Calibration Samples | 2,048 (8× industry standard) |
| Sequence Length | 4,096 tokens |
| desc_act | True (activation ordering) |
| sym | True (symmetric quantization) |
| group_size | 128 |
Use Cases
Ideal for:
- Mathematical reasoning and graduate-level problem solving
- Agentic coding (72.4% SWE-bench — real GitHub issue resolution)
- Long-context analysis (262K-1M tokens, linear scaling)
- Multimodal understanding (text + images + video)
- Tool use and multi-step reasoning
- Multilingual deployment (201 languages)
- Single-GPU deployment at quantized precision
Technical Specifications
| Specification | Value |
|---|---|
| Model Family | Qwen3.5 |
| Variant | 27B (Dense, Hybrid DeltaNet) |
| Total Parameters | 27B |
| Total Layers | 64 (text) + 27 (vision) |
| DeltaNet Layers | 48 |
| Full Attention Layers | 16 |
| Attention Heads | 24 Q, 4 KV (GQA) |
| Head Dimension | 256 |
| Hidden Size | 5,120 |
| Intermediate Size | 17,408 |
| Linear Attention | 16 key heads, 48 value heads, head_dim=128, conv_kernel=4 |
| Context Length | 262K (native), ~1M (extended) |
| Vocab Size | 248K |
| Supported Languages | 201 |
| Multimodal | Text, Image, Video (native) |
License
Apache 2.0
Citation
@software{qwen35_27b_gptq_2026,
title = {Qwen3.5-27B - TevunahAi Multi-Level GPTQ},
author = {TevunahAi},
year = {2026},
note = {Multi-Level GPTQ with EoRA rank-64 for hybrid Gated DeltaNet + Attention architecture},
url = {https://huggingface.co/TevunahAi/Qwen3.5-27B-TevunahAi-GPTQ}
}
@misc{qwen35_2026,
title = {Qwen3.5 Technical Report},
author = {Qwen Team, Alibaba Cloud},
year = {2026},
url = {https://qwenlm.github.io/blog/qwen3.5/}
}
@article{liu2024eora,
title = {EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation},
author = {Liu, Shih-Yang and Wang, Huck Yang and Cheng, Hong-Yi Michael and Khailany, Brucek and Molchanov, Pavlo},
journal = {arXiv preprint arXiv:2410.21271},
year = {2024},
url = {https://arxiv.org/abs/2410.21271},
note = {NVIDIA Research}
}
Acknowledgments
This quantization leverages the hybrid Gated DeltaNet + Full Attention architecture across 64 decoder layers, requiring precision-aware treatment of fundamentally different layer types (linear vs quadratic attention). EoRA rank-64 error correction is applied comprehensively across all quantizable layers to maximize quality retention at aggressive compression ratios.
Quantized by TevunahAi LLC
- Downloads last month
- 440
Model tree for TevunahAi/Qwen3.5-27B-TevunahAi-GPTQ
Base model
Qwen/Qwen3.5-27B