Swiss AI Apertus-8B-Instruct - TevunahAi Ultra-Hybrid GPTQ with EoRA

Model Details

Property Value
Base Model swiss-ai/Apertus-8B-Instruct-2509
Architecture Apertus Dense Decoder-only Transformer
Parameters 8B
Context Length 65,536 tokens
Languages 1,811
Quantization TevunahAi Ultra-Hybrid GPTQ + EoRA
Original Size ~16 GB (BF16)
Quantized Size ~6 GB
Compression ~63% reduction
Active VRAM ~9 GB (with inference overhead)

Architecture Breakdown

Swiss AI Apertus-8B-Instruct is a fully open, transparent multilingual model developed by EPFL, ETH Zurich, and CSCS:

Layer Composition (32 total layers)

  • 32 Transformer Decoder Layers: Standard dense attention architecture
  • 32 Attention Heads: Full multi-head attention
  • Hidden Size: 4,096
  • Intermediate Size: 14,336
  • Vocab Size: 131,072 (massive multilingual vocabulary)
  • Novel xIELU Activation: New activation function developed for Apertus
  • RMSNorm + RoPE: Modern normalization and position encoding

Why This Matters

  • 1,811 languages: Unprecedented multilingual support
  • 65K context window: Handle very long documents
  • 15 trillion training tokens: Extensive pretraining
  • SFT + QRPO alignment: Optimized for instruction following
  • Fully open: Apache 2.0 compatible, reproducible

Quantization Strategy

TevunahAi Ultra-Hybrid Mixed-Precision with EoRA Error Recovery

This quantization uses EoRA (Error-optimized Low-Rank Adaptation) - NVIDIA's technique for recovering quantization error through learned low-rank adapters applied during the quantization process.

Component Precision EoRA Rank Rationale
Layer 0 (all projections) INT8 2048 Maximum error correction at input - errors propagate through entire model
Layer 31 (all projections) INT8 2048 Maximum error correction at output - directly affects token prediction
Attention Q/K/V/O (layers 1-30) INT8 128 Quality preservation for reasoning
MLP (layers 1-24) INT4 128 Maximum compression in middle layers
MLP (layers 25-30) INT8 128 Higher precision near output
Embeddings BF16 - Preserved for token accuracy
LM Head BF16 - Preserved for output quality

Why EoRA-2048 Boundaries?

  • Layer 0: First layer errors compound through all 31 subsequent layers
  • Layer 31: Final layer directly determines next token prediction
  • Rank 2048: Maximum error correction capacity (~8-10% recovery vs ~5% for rank 128)
  • 224 layer-specific rules: Not a blanket quantization - each projection optimized individually

Calibration

  • 2,048 samples (8x industry standard of 256)
  • 4,096 sequence length
  • Diverse datasets: UltraChat, SlimOrca, Code-Feedback, Orca-Math
  • Premium calibration for superior quality retention

Performance Benchmarks

Quantized Model (lm-eval-harness, 0-shot)

Task Score Metric Stderr
HellaSwag 79.28% acc_norm ±0.40%
Winogrande 69.85% acc ±1.29%
ARC-Challenge 58.11% acc_norm ±1.44%
TruthfulQA MC2 58.94% acc ±1.53%

Quality Retention

Benchmark Baseline (FP16) Quantized Retention
HellaSwag 59.8% (acc) 60.30% (acc) 100.8%
Winogrande 70.6% 69.85% 99.0%

EoRA-2048 boundary strategy delivering near-lossless quantization on reasoning tasks.

Usage

GPTQModel (Recommended)

from gptqmodel import GPTQModel
from transformers import AutoTokenizer

model = GPTQModel.from_quantized(
    "TevunahAi/Apertus-8B-Instruct-TevunahAi-GPTQ",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "TevunahAi/Apertus-8B-Instruct-TevunahAi-GPTQ",
    trust_remote_code=True
)

messages = [
    {"role": "system", "content": "You are a helpful multilingual assistant."},
    {"role": "user", "content": "Explain quantum computing in simple terms."},
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "TevunahAi/Apertus-8B-Instruct-TevunahAi-GPTQ",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "TevunahAi/Apertus-8B-Instruct-TevunahAi-GPTQ",
    trust_remote_code=True
)

# Use same generation code as above

vLLM (Production)

pip install -U vllm

vllm serve TevunahAi/Apertus-8B-Instruct-TevunahAi-GPTQ \
    --max-model-len 8192 \
    --trust-remote-code \
    --tensor-parallel-size 1

Installation

pip install gptqmodel transformers>=4.48

Known Issues

  • Tokenizer regex warning: Can be safely ignored or fixed with fix_mistral_regex=True when loading tokenizer
  • xIELU CUDA kernel: Falls back to Python implementation if CUDA kernel not available (no performance impact for inference)

Memory Requirements

Inference (quantized model)

Context VRAM Required
Short (2K) 6-8 GB
Medium (8K) 8-10 GB
Long (32K) 12-16 GB
Full (65K) 24+ GB

Tested on: RTX 5000 Ada (32GB) - 9GB active VRAM during benchmarks

Quantization (reproduction)

  • Minimum: 32GB VRAM + 64GB RAM
  • Used: RTX 5000 Ada (32GB) + Dual Xeon Max 9480 (128GB HBM2e + 256GB DDR5)

Quantization Details

Specification Value
Method GPTQ + Ultra-Hybrid + EoRA
Quantizer GPTQModel
EoRA Boundary Rank 2048 (layers 0 & 31)
EoRA Standard Rank 128 (layers 1-30)
Calibration Samples 2,048 (8x industry standard)
Sequence Length 4,096 tokens
Group Size 128
desc_act False
sym True (symmetric quantization)
Bits (default) 4
Layer Rules 224 custom precision rules

Use Cases

Ideal for:

  • 🌍 Multilingual applications (1,811 languages)
  • 📄 Long document analysis (65K context)
  • 💬 Instruction following (SFT + QRPO aligned)
  • 💻 Code and math reasoning
  • 🔧 Resource-constrained deployment (16GB → 6GB)
  • 🖥️ Consumer GPU inference (RTX 3080/4070 and up)

Technical Specifications

Specification Value
Model Family Swiss AI Apertus
Variant 8B-Instruct-2509
Total Parameters 8B
Total Layers 32
Hidden Size 4,096
Intermediate Size 14,336
Attention Heads 32
Activation xIELU (novel)
Normalization RMSNorm
Position Encoding RoPE
Context Length 65,536
Vocab Size 131,072
Training Tokens 15 trillion
Supported Languages 1,811

Developers

Base Model by:

  • EPFL (Swiss Federal Institute of Technology Lausanne)
  • ETH Zurich (Swiss Federal Institute of Technology Zurich)
  • CSCS (Swiss National Supercomputing Centre)

Quantization by:

  • TevunahAi - Professional AI Model Quantization Service

Acknowledgments

  • NVIDIA for the EoRA (Error-optimized Low-Rank Adaptation) technique used in this quantization
  • Swiss AI Initiative for developing the open Apertus model family
  • GPTQModel team for the excellent quantization framework

License

Check Swiss AI license terms for the base model.

Citation

@software{apertus_8b_gptq_2025,
  title = {Swiss AI Apertus-8B-Instruct - TevunahAi Ultra-Hybrid GPTQ with EoRA},
  author = {TevunahAi},
  year = {2025},
  note = {Ultra-Hybrid GPTQ with EoRA-2048 boundary layers for maximum quality retention},
  url = {https://huggingface.co/TevunahAi/Apertus-8B-Instruct-TevunahAi-GPTQ}
}

@misc{swiss_ai_apertus_2025,
  title = {Apertus: A Fully Open Multilingual Language Model},
  author = {Swiss AI Initiative},
  year = {2025},
  url = {https://huggingface.co/swiss-ai/Apertus-8B-Instruct-2509}
}

https://huggingface.co/TevunahAi

Downloads last month
11
Safetensors
Model size
11B params
Tensor type
BF16
·
I32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TevunahAi/Apertus-8B-Instruct-TevunahAi-GPTQ

Quantized
(33)
this model

Collection including TevunahAi/Apertus-8B-Instruct-TevunahAi-GPTQ