base_model_relation: quantized

Qwen3.5-27B - TevunahAi Multi-Level GPTQ

Property Value
Base Model Qwen/Qwen3.5-27B
Architecture HYBRID (Gated DeltaNet + Full Attention, VLM)
Parameters 27B (dense)
Context Length 262K (native), extensible to ~1M
Quantization TevunahAi Multi-Level GPTQ + EoRA
Original Size ~54 GB (BF16)
Quantized Size 17.8 GB
Compression 67.0% reduction

Model Description

Qwen3.5-27B is Alibaba Cloud's dense multimodal foundation model, the largest in the Qwen3.5 dense lineup. It uses a hybrid architecture combining Gated Delta Networks (linear attention) with standard full attention in a 3:1 pattern across 64 decoder layers. Natively multimodal — trained from scratch on text, images, and video through early fusion. It outperforms many 70B+ models from previous generations on reasoning and coding benchmarks.

This quantization preserves the full hybrid architecture with precision-aware layer treatment, applying EoRA rank-64 error correction across all quantizable layers for maximum quality retention.

Architecture Details

Qwen3.5-27B uses a 64-layer text decoder arranged in repeating 4-layer blocks:

Block Pattern (×16): [DeltaNet+MLP] × 3 → [FullAttn+MLP] × 1

Component Specification
Total Decoder Layers 64
Gated DeltaNet Layers 48 (linear attention, O(n) complexity)
Full Attention Layers 16 (GQA: 24 Q heads, 4 KV heads, head_dim=256)
Dense MLP SiLU activation, intermediate_size=17,408
Hidden Size 5,120
Vocab Size 248K
Vision Encoder DeepStack ViT, 27 layers, patch_size=16, hidden_size=1,152
Multimodal RoPE Interleaved sections [11, 11, 10]
Multi-Token Prediction 1 MTP layer
Languages 201 languages and dialects

Why This Architecture Matters:

  • Gated DeltaNet provides constant memory complexity for long-range context — 262K native context with linear scaling instead of quadratic
  • Full attention layers at every 4th position ensure precise information retrieval
  • 27B dense model achieves GPQA Diamond 85.5% and SWE-bench Verified 72.4%, outperforming many 70B+ models from previous generations
  • Fixes 72% of real GitHub issues on SWE-bench — not an abstract metric, direct utility for agentic coding

Quantization Strategy

TevunahAi Multi-Level GPTQ with EoRA rank-64 applied to ALL quantizable layers. No MoE routing in the 27B variant means EoRA is affordable across the full model, providing comprehensive error correction.

Component Precision Rationale
Full Attention (q/k/v/o_proj) — 16 layers INT8 + EoRA rank-64 Quality preservation for critical attention layers
Linear Attention (in_proj_qkv/in_proj_z/out_proj) — 48 layers INT4 + EoRA rank-64 DeltaNet projections, EoRA compensates for aggressive compression
Dense MLP (gate/up/down_proj) — 64 layers INT4 + EoRA rank-64 Standard MLP compression with error correction
Vision Encoder — 27 layers FP16 (unquantized) Full precision preserved for visual understanding
Embeddings & LM Head FP16 Preserved for accuracy

Calibration

  • 2,048 samples (8× industry standard of 256)
  • 4,096 sequence length
  • Premium calibration for superior quality retention

Performance Benchmarks

Original Model (Qwen benchmarks):

Benchmark Score
MMLU-Pro 86.1%
GPQA Diamond 85.5%
SWE-bench Verified 72.4%
Terminal-Bench 2.0 41.6%
MMMU-Pro (vision) 74.0%
MathVision 85.8%
LongBench v2 60.7%
VideoMME (w/ subs) 86.5%
MMMLU (multilingual) 84.5%

Notable Comparisons:

  • Outperforms Gemma 3 27B by 18.6 points on MMLU-Pro (86.1 vs 67.5) and 43 points on GPQA Diamond (85.5 vs 42.4)
  • SWE-bench Verified 72.4% — fixes 72% of real GitHub issues, competitive with models at much larger scale
  • Surpasses many previous-generation 70B+ models across reasoning and coding benchmarks

Expected Quantized Performance:

  • Reasoning tasks: 97-99% of baseline (EoRA recovery on attention layers)
  • Code generation: 96-98% of baseline
  • Long context: 96-99% of baseline (DeltaNet advantage + EoRA compensation)
  • Vision tasks: 99-100% of baseline (encoder preserved at FP16)
  • General chat: 98-99% of baseline

Formal benchmarks pending — inference quality verified manually.

Usage

GPTQModel (Recommended):

from gptqmodel import GPTQModel
from transformers import AutoTokenizer

model = GPTQModel.load(
    "TevunahAi/Qwen3.5-27B-TevunahAi-GPTQ",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "TevunahAi/Qwen3.5-27B-TevunahAi-GPTQ",
    trust_remote_code=True
)

# Generate
prompt = "Explain the difference between linear and quadratic attention complexity."
output = model.generate(
    **tokenizer(prompt, return_tensors='pt').to('cuda'),
    max_new_tokens=256
)
print(tokenizer.decode(output[0]))

With Thinking Mode (default):

messages = [{"role": "user", "content": "Solve: What is the integral of x²·sin(x)?"}]

tokenized = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_tensors="pt"
).to('cuda')

outputs = model.generate(
    tokenized,
    max_new_tokens=2048,
    temperature=1.0,
    top_p=0.95,
    top_k=20,
    presence_penalty=1.5
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Direct Response (disable thinking):

tokenized = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    enable_thinking=False,
    add_generation_prompt=True,
    return_tensors="pt"
).to('cuda')

outputs = model.generate(
    tokenized,
    max_new_tokens=128,
    do_sample=False,
    num_beams=1
)

Installation:

pip install gptqmodel transformers>=4.48

vLLM (Experimental):

pip install -U "vllm>=0.12.0"

vllm serve TevunahAi/Qwen3.5-27B-TevunahAi-GPTQ \
    --max-num-seqs 8 \
    --tensor-parallel-size 1 \
    --max-model-len 32768 \
    --trust-remote-code

Memory Requirements

Inference (quantized model):

  • Minimum: 20 GB VRAM (short context)
  • Recommended: 24-32 GB VRAM
  • For 262K context: 48 GB+ VRAM

Quantization (reproduction):

  • Hardware Used: Dual Xeon Max 9480 (128GB HBM2e + 256GB DDR5) + RTX 5000 Ada 32GB
  • Time: 208.3 minutes (3.5 hours)

Quantization Details

Specification Value
Method GPTQ + EoRA rank-64
Quantizer GPTQModel
Calibration Samples 2,048 (8× industry standard)
Sequence Length 4,096 tokens
desc_act True (activation ordering)
sym True (symmetric quantization)
group_size 128

Use Cases

Ideal for:

  • Mathematical reasoning and graduate-level problem solving
  • Agentic coding (72.4% SWE-bench — real GitHub issue resolution)
  • Long-context analysis (262K-1M tokens, linear scaling)
  • Multimodal understanding (text + images + video)
  • Tool use and multi-step reasoning
  • Multilingual deployment (201 languages)
  • Single-GPU deployment at quantized precision

Technical Specifications

Specification Value
Model Family Qwen3.5
Variant 27B (Dense, Hybrid DeltaNet)
Total Parameters 27B
Total Layers 64 (text) + 27 (vision)
DeltaNet Layers 48
Full Attention Layers 16
Attention Heads 24 Q, 4 KV (GQA)
Head Dimension 256
Hidden Size 5,120
Intermediate Size 17,408
Linear Attention 16 key heads, 48 value heads, head_dim=128, conv_kernel=4
Context Length 262K (native), ~1M (extended)
Vocab Size 248K
Supported Languages 201
Multimodal Text, Image, Video (native)

License

Apache 2.0

Citation

@software{qwen35_27b_gptq_2026,
  title = {Qwen3.5-27B - TevunahAi Multi-Level GPTQ},
  author = {TevunahAi},
  year = {2026},
  note = {Multi-Level GPTQ with EoRA rank-64 for hybrid Gated DeltaNet + Attention architecture},
  url = {https://huggingface.co/TevunahAi/Qwen3.5-27B-TevunahAi-GPTQ}
}

@misc{qwen35_2026,
  title = {Qwen3.5 Technical Report},
  author = {Qwen Team, Alibaba Cloud},
  year = {2026},
  url = {https://qwenlm.github.io/blog/qwen3.5/}
}

@article{liu2024eora,
  title = {EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation},
  author = {Liu, Shih-Yang and Wang, Huck Yang and Cheng, Hong-Yi Michael and Khailany, Brucek and Molchanov, Pavlo},
  journal = {arXiv preprint arXiv:2410.21271},
  year = {2024},
  url = {https://arxiv.org/abs/2410.21271},
  note = {NVIDIA Research}
}

Acknowledgments

This quantization leverages the hybrid Gated DeltaNet + Full Attention architecture across 64 decoder layers, requiring precision-aware treatment of fundamentally different layer types (linear vs quadratic attention). EoRA rank-64 error correction is applied comprehensively across all quantizable layers to maximize quality retention at aggressive compression ratios.


Quantized by TevunahAi LLC

https://huggingface.co/TevunahAi www.Tevunah.ai

Downloads last month
440
Safetensors
Model size
27B params
Tensor type
BF16
·
I32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TevunahAi/Qwen3.5-27B-TevunahAi-GPTQ

Base model

Qwen/Qwen3.5-27B
Quantized
(203)
this model

Collection including TevunahAi/Qwen3.5-27B-TevunahAi-GPTQ

Paper for TevunahAi/Qwen3.5-27B-TevunahAi-GPTQ