Ultra Quantization Hybrid Model.
Collection
These models are quantized in mixed precision that allows them to have a smaller footprint than fp8, but still high quality. • 15 items • Updated • 1
| Property | Value |
|---|---|
| Base Model | swiss-ai/Apertus-8B-Instruct-2509 |
| Architecture | Apertus Dense Decoder-only Transformer |
| Parameters | 8B |
| Context Length | 65,536 tokens |
| Languages | 1,811 |
| Quantization | TevunahAi Ultra-Hybrid GPTQ + EoRA |
| Original Size | ~16 GB (BF16) |
| Quantized Size | ~6 GB |
| Compression | ~63% reduction |
| Active VRAM | ~9 GB (with inference overhead) |
Swiss AI Apertus-8B-Instruct is a fully open, transparent multilingual model developed by EPFL, ETH Zurich, and CSCS:
This quantization uses EoRA (Error-optimized Low-Rank Adaptation) - NVIDIA's technique for recovering quantization error through learned low-rank adapters applied during the quantization process.
| Component | Precision | EoRA Rank | Rationale |
|---|---|---|---|
| Layer 0 (all projections) | INT8 | 2048 | Maximum error correction at input - errors propagate through entire model |
| Layer 31 (all projections) | INT8 | 2048 | Maximum error correction at output - directly affects token prediction |
| Attention Q/K/V/O (layers 1-30) | INT8 | 128 | Quality preservation for reasoning |
| MLP (layers 1-24) | INT4 | 128 | Maximum compression in middle layers |
| MLP (layers 25-30) | INT8 | 128 | Higher precision near output |
| Embeddings | BF16 | - | Preserved for token accuracy |
| LM Head | BF16 | - | Preserved for output quality |
| Task | Score | Metric | Stderr |
|---|---|---|---|
| HellaSwag | 79.28% | acc_norm | ±0.40% |
| Winogrande | 69.85% | acc | ±1.29% |
| ARC-Challenge | 58.11% | acc_norm | ±1.44% |
| TruthfulQA MC2 | 58.94% | acc | ±1.53% |
| Benchmark | Baseline (FP16) | Quantized | Retention |
|---|---|---|---|
| HellaSwag | 59.8% (acc) | 60.30% (acc) | 100.8% ✅ |
| Winogrande | 70.6% | 69.85% | 99.0% ✅ |
EoRA-2048 boundary strategy delivering near-lossless quantization on reasoning tasks.
from gptqmodel import GPTQModel
from transformers import AutoTokenizer
model = GPTQModel.from_quantized(
"TevunahAi/Apertus-8B-Instruct-TevunahAi-GPTQ",
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"TevunahAi/Apertus-8B-Instruct-TevunahAi-GPTQ",
trust_remote_code=True
)
messages = [
{"role": "system", "content": "You are a helpful multilingual assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."},
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"TevunahAi/Apertus-8B-Instruct-TevunahAi-GPTQ",
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"TevunahAi/Apertus-8B-Instruct-TevunahAi-GPTQ",
trust_remote_code=True
)
# Use same generation code as above
pip install -U vllm
vllm serve TevunahAi/Apertus-8B-Instruct-TevunahAi-GPTQ \
--max-model-len 8192 \
--trust-remote-code \
--tensor-parallel-size 1
pip install gptqmodel transformers>=4.48
fix_mistral_regex=True when loading tokenizer| Context | VRAM Required |
|---|---|
| Short (2K) | 6-8 GB |
| Medium (8K) | 8-10 GB |
| Long (32K) | 12-16 GB |
| Full (65K) | 24+ GB |
Tested on: RTX 5000 Ada (32GB) - 9GB active VRAM during benchmarks
| Specification | Value |
|---|---|
| Method | GPTQ + Ultra-Hybrid + EoRA |
| Quantizer | GPTQModel |
| EoRA Boundary Rank | 2048 (layers 0 & 31) |
| EoRA Standard Rank | 128 (layers 1-30) |
| Calibration Samples | 2,048 (8x industry standard) |
| Sequence Length | 4,096 tokens |
| Group Size | 128 |
| desc_act | False |
| sym | True (symmetric quantization) |
| Bits (default) | 4 |
| Layer Rules | 224 custom precision rules |
Ideal for:
| Specification | Value |
|---|---|
| Model Family | Swiss AI Apertus |
| Variant | 8B-Instruct-2509 |
| Total Parameters | 8B |
| Total Layers | 32 |
| Hidden Size | 4,096 |
| Intermediate Size | 14,336 |
| Attention Heads | 32 |
| Activation | xIELU (novel) |
| Normalization | RMSNorm |
| Position Encoding | RoPE |
| Context Length | 65,536 |
| Vocab Size | 131,072 |
| Training Tokens | 15 trillion |
| Supported Languages | 1,811 |
Base Model by:
Quantization by:
Check Swiss AI license terms for the base model.
@software{apertus_8b_gptq_2025,
title = {Swiss AI Apertus-8B-Instruct - TevunahAi Ultra-Hybrid GPTQ with EoRA},
author = {TevunahAi},
year = {2025},
note = {Ultra-Hybrid GPTQ with EoRA-2048 boundary layers for maximum quality retention},
url = {https://huggingface.co/TevunahAi/Apertus-8B-Instruct-TevunahAi-GPTQ}
}
@misc{swiss_ai_apertus_2025,
title = {Apertus: A Fully Open Multilingual Language Model},
author = {Swiss AI Initiative},
year = {2025},
url = {https://huggingface.co/swiss-ai/Apertus-8B-Instruct-2509}
}
Base model
swiss-ai/Apertus-8B-2509