Swiss AI Apertus-8B-Instruct - TevunahAi Ultra-Hybrid GPTQ with EoRA

Model Details

Property	Value
Base Model	swiss-ai/Apertus-8B-Instruct-2509
Architecture	Apertus Dense Decoder-only Transformer
Parameters	8B
Context Length	65,536 tokens
Languages	1,811
Quantization	TevunahAi Ultra-Hybrid GPTQ + EoRA
Original Size	~16 GB (BF16)
Quantized Size	~6 GB
Compression	~63% reduction
Active VRAM	~9 GB (with inference overhead)

Architecture Breakdown

Swiss AI Apertus-8B-Instruct is a fully open, transparent multilingual model developed by EPFL, ETH Zurich, and CSCS:

Layer Composition (32 total layers)

32 Transformer Decoder Layers: Standard dense attention architecture
32 Attention Heads: Full multi-head attention
Hidden Size: 4,096
Intermediate Size: 14,336
Vocab Size: 131,072 (massive multilingual vocabulary)
Novel xIELU Activation: New activation function developed for Apertus
RMSNorm + RoPE: Modern normalization and position encoding

Why This Matters

1,811 languages: Unprecedented multilingual support
65K context window: Handle very long documents
15 trillion training tokens: Extensive pretraining
SFT + QRPO alignment: Optimized for instruction following
Fully open: Apache 2.0 compatible, reproducible

Quantization Strategy

TevunahAi Ultra-Hybrid Mixed-Precision with EoRA Error Recovery

This quantization uses EoRA (Error-optimized Low-Rank Adaptation) - NVIDIA's technique for recovering quantization error through learned low-rank adapters applied during the quantization process.

Component	Precision	EoRA Rank	Rationale
Layer 0 (all projections)	INT8	2048	Maximum error correction at input - errors propagate through entire model
Layer 31 (all projections)	INT8	2048	Maximum error correction at output - directly affects token prediction
Attention Q/K/V/O (layers 1-30)	INT8	128	Quality preservation for reasoning
MLP (layers 1-24)	INT4	128	Maximum compression in middle layers
MLP (layers 25-30)	INT8	128	Higher precision near output
Embeddings	BF16	-	Preserved for token accuracy
LM Head	BF16	-	Preserved for output quality

Why EoRA-2048 Boundaries?

Layer 0: First layer errors compound through all 31 subsequent layers
Layer 31: Final layer directly determines next token prediction
Rank 2048: Maximum error correction capacity (~8-10% recovery vs ~5% for rank 128)
224 layer-specific rules: Not a blanket quantization - each projection optimized individually

Calibration

2,048 samples (8x industry standard of 256)
4,096 sequence length
Diverse datasets: UltraChat, SlimOrca, Code-Feedback, Orca-Math
Premium calibration for superior quality retention

Performance Benchmarks

Quantized Model (lm-eval-harness, 0-shot)

Task	Score	Metric	Stderr
HellaSwag	79.28%	acc_norm	±0.40%
Winogrande	69.85%	acc	±1.29%
ARC-Challenge	58.11%	acc_norm	±1.44%
TruthfulQA MC2	58.94%	acc	±1.53%

Quality Retention

Benchmark	Baseline (FP16)	Quantized	Retention
HellaSwag	59.8% (acc)	60.30% (acc)	100.8% ✅
Winogrande	70.6%	69.85%	99.0% ✅

EoRA-2048 boundary strategy delivering near-lossless quantization on reasoning tasks.

Usage

GPTQModel (Recommended)

from gptqmodel import GPTQModel
from transformers import AutoTokenizer

model = GPTQModel.from_quantized(
    "TevunahAi/Apertus-8B-Instruct-TevunahAi-GPTQ",
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "TevunahAi/Apertus-8B-Instruct-TevunahAi-GPTQ",
    trust_remote_code=True
)

messages = [
    {"role": "system", "content": "You are a helpful multilingual assistant."},
    {"role": "user", "content": "Explain quantum computing in simple terms."},
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "TevunahAi/Apertus-8B-Instruct-TevunahAi-GPTQ",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "TevunahAi/Apertus-8B-Instruct-TevunahAi-GPTQ",
    trust_remote_code=True
)

# Use same generation code as above

vLLM (Production)

pip install -U vllm

vllm serve TevunahAi/Apertus-8B-Instruct-TevunahAi-GPTQ \
    --max-model-len 8192 \
    --trust-remote-code \
    --tensor-parallel-size 1

Installation

pip install gptqmodel transformers>=4.48

Known Issues

Tokenizer regex warning: Can be safely ignored or fixed with fix_mistral_regex=True when loading tokenizer
xIELU CUDA kernel: Falls back to Python implementation if CUDA kernel not available (no performance impact for inference)

Memory Requirements

Inference (quantized model)

Context	VRAM Required
Short (2K)	6-8 GB
Medium (8K)	8-10 GB
Long (32K)	12-16 GB
Full (65K)	24+ GB

Tested on: RTX 5000 Ada (32GB) - 9GB active VRAM during benchmarks

Quantization (reproduction)

Minimum: 32GB VRAM + 64GB RAM
Used: RTX 5000 Ada (32GB) + Dual Xeon Max 9480 (128GB HBM2e + 256GB DDR5)

Quantization Details

Specification	Value
Method	GPTQ + Ultra-Hybrid + EoRA
Quantizer	GPTQModel
EoRA Boundary Rank	2048 (layers 0 & 31)
EoRA Standard Rank	128 (layers 1-30)
Calibration Samples	2,048 (8x industry standard)
Sequence Length	4,096 tokens
Group Size	128
desc_act	False
sym	True (symmetric quantization)
Bits (default)	4
Layer Rules	224 custom precision rules

Use Cases

Ideal for:

🌍 Multilingual applications (1,811 languages)
📄 Long document analysis (65K context)
💬 Instruction following (SFT + QRPO aligned)
💻 Code and math reasoning
🔧 Resource-constrained deployment (16GB → 6GB)
🖥️ Consumer GPU inference (RTX 3080/4070 and up)

Technical Specifications

Specification	Value
Model Family	Swiss AI Apertus
Variant	8B-Instruct-2509
Total Parameters	8B
Total Layers	32
Hidden Size	4,096
Intermediate Size	14,336
Attention Heads	32
Activation	xIELU (novel)
Normalization	RMSNorm
Position Encoding	RoPE
Context Length	65,536
Vocab Size	131,072
Training Tokens	15 trillion
Supported Languages	1,811

Developers

Base Model by:

EPFL (Swiss Federal Institute of Technology Lausanne)
ETH Zurich (Swiss Federal Institute of Technology Zurich)
CSCS (Swiss National Supercomputing Centre)

Quantization by:

TevunahAi - Professional AI Model Quantization Service

Acknowledgments

NVIDIA for the EoRA (Error-optimized Low-Rank Adaptation) technique used in this quantization
Swiss AI Initiative for developing the open Apertus model family
GPTQModel team for the excellent quantization framework

License

Check Swiss AI license terms for the base model.

Citation

@software{apertus_8b_gptq_2025,
  title = {Swiss AI Apertus-8B-Instruct - TevunahAi Ultra-Hybrid GPTQ with EoRA},
  author = {TevunahAi},
  year = {2025},
  note = {Ultra-Hybrid GPTQ with EoRA-2048 boundary layers for maximum quality retention},
  url = {https://huggingface.co/TevunahAi/Apertus-8B-Instruct-TevunahAi-GPTQ}
}

@misc{swiss_ai_apertus_2025,
  title = {Apertus: A Fully Open Multilingual Language Model},
  author = {Swiss AI Initiative},
  year = {2025},
  url = {https://huggingface.co/swiss-ai/Apertus-8B-Instruct-2509}
}

https://huggingface.co/TevunahAi