---
base_model: Zyphra/ZAYA1-8B
base_model_relation: quantized
language:
  - en
license: apache-2.0
tags:
  - bitsandbytes
  - quantized
  - 4-bit
  - 8-bit
  - nf4
  - zaya
  - mixture-of-experts
  - reasoning
pipeline_tag: text-generation
---

# ZAYA1-8B — bitsandbytes Quantizations

bitsandbytes quantizations of [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B).

> **Note:** ZAYA1-8B uses a custom sparse MoE architecture (`ZayaForCausalLM`) that is not yet supported by llama.cpp. GGUF files will be added once support lands ([issue #22776](https://github.com/ggml-org/llama.cpp/issues/22776)). In the meantime, these bitsandbytes quantizations provide a working alternative.

---

## Available Files

| Folder | Format | Bits | Size | Description |
|---|---|---|---|---|
| `NF4/` | NF4 | 4-bit | ~5.0 GB | Normal Float 4 — best 4-bit quality |
| `NF4-DQ/` | NF4 + DQ | ~4-bit | ~4.7 GB | NF4 + double quantization — slightly smaller |
| `INT8/` | INT8 | 8-bit | ~9.0 GB | Near-lossless |

---

## About ZAYA1-8B

ZAYA1-8B is a small mixture of experts language model with **760M active parameters** and **8.4B total parameters** trained end-to-end by Zyphra. It sets a new standard of intelligence efficiency for its parameter count through a combination of novel architecture and innovations in pretraining and post-training.

ZAYA1-8B excels at detailed long-form reasoning, especially for mathematical and coding tasks. Due to its small total parameter count, it can also be deployed on-device for local LLM applications.

- **Technical report:** https://www.zyphra.com/zaya1-8b-technical-report
- **Blog post:** https://www.zyphra.com/post/zaya1-8b
- **Pretraining base:** [Zyphra/ZAYA1-reasoning-base](https://huggingface.co/Zyphra/ZAYA1-reasoning-base)

---

## Performance

[![Performance chart](https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/f5tbexK3BumixnJuBZxo_.png)](https://huggingface.co/Zyphra/ZAYA1-8B)

[![Scaling comparison](https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/W8bn6ZAocWKFuicjtjesv.png)](https://huggingface.co/Zyphra/ZAYA1-8B)

### In-class comparison

| Category | Benchmark | ZAYA1-8B (0.7B / 8B) | Qwen3-4B-Think | Qwen3.5-4B | Gemma-4-E4B-it |
|---|---|---|---|---|---|
| Math | AIME'26 | **89.1** | 77.5 | 84.5 | 50.3 |
| Math | HMMT Feb.'26 | **71.6** | 60.8 | 63.6 | 32.1 |
| Math | IMO-AnswerBench | **59.3** | 50.9 | 48.7 | 27.3 |
| Math | APEX-shortlist | **32.2** | 16.9 | -- | 6.1 |
| Code | LiveCodeBench-v6 | **65.8** | 54.2 | -- | 54.2 |
| Knowledge | GPQA-Diamond | **71.0** | 66.5 | 76.2 | 57.4 |
| Knowledge | MMLU-Pro | 74.2 | 74.3 | **79.1** | 70.2 |
| Instruction | IFEval | 85.58 | 86.8 | **89.8** | 88.50 |
| Instruction | IFBench | 52.56 | 52.9 | **59.2** | 42.67 |
| Style & chat | EQBench | 72.95 | 79.6 | 79.5 | **80.15** |
| Agentic | BFCL-v4 | 39.22 | **49.7** | 45.2 | 31.7 |

### Scaling comparison against larger models

| Model | Active | Total | AIME'26 | HMMT'26 | LCB-v6 | GPQA-D | MMLU-Pro |
|---|---|---|---|---|---|---|---|
| **ZAYA1-8B** | **0.7B** | **8B** | **89.1** | **71.6** | 63.8 | 71.0 | 74.2 |
| Arcee-Trinity-Mini | 3B | 26B | 59.6 | 36.9 | 33.3 | 46.8 | 70.6 |
| N3-Nano-30B | 3B | 30B | 90.1 | 75.5 | **64.6** | **75.1** | **78.9** |
| OLMo-3.1-32B-Think | 32B | 32B | 78.9 | 50.6 | 58.3 | 59.6 | 75.8 |
| Qwen3-Next-80B-A3B | 3B | 80B | 90.2 | 79.3 | 67.8 | 76.7 | 82.6 |
| Intellect-3 | 12B | 106B | 86.3 | 72.2 | 66.8 | 74.6 | 82.3 |
| Mistral-Small-4-119B | 6B | 119B | 86.4 | 70.6 | 57.9 | 77.2 | 81.6 |

*All numbers from the Zyphra evaluation harness. Models ordered by total parameter count.*

---

## Download

> HuggingFace's inference widget and one-click download are not available for this repo.  
> `ZayaForCausalLM` requires Zyphra's custom `transformers` fork — use the commands below.

### Download a specific quantization

```bash
# NF4 (4-bit) — recommended
huggingface-cli download barozp/ZAYA1-8B-BNB --include "NF4/*" --local-dir ./ZAYA1-8B-NF4

# NF4 with double quantization
huggingface-cli download barozp/ZAYA1-8B-BNB --include "NF4-DQ/*" --local-dir ./ZAYA1-8B-NF4-DQ

# INT8 (8-bit)
huggingface-cli download barozp/ZAYA1-8B-BNB --include "INT8/*" --local-dir ./ZAYA1-8B-INT8
```

### Download everything

```bash
huggingface-cli download barozp/ZAYA1-8B-BNB --local-dir ./ZAYA1-8B-BNB
```

---

## Usage

Zyphra's custom `transformers` fork is required to load `ZayaForCausalLM`:

```bash
pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zaya1"
pip install bitsandbytes>=0.43.0 accelerate
```

### Load NF4 (4-bit)

```python
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="NF4",
                                           trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="NF4",
                                              quantization_config=bnb_config,
                                              device_map="auto",
                                              trust_remote_code=True)
```

### Load INT8 (8-bit)

```python
bnb_config = BitsAndBytesConfig(load_in_8bit=True)

tokenizer = AutoTokenizer.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="INT8",
                                           trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="INT8",
                                              quantization_config=bnb_config,
                                              device_map="auto",
                                              trust_remote_code=True)
```

### Inference

```python
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user",   "content": "What is the sum of the first 100 prime numbers?"},
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_new_tokens=512)
print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True))
```

---

## Quantization Details

- **Source:** [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B) (BF16 safetensors)
- **Method:** [bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes)
- **Quantized by:** [barozp](https://huggingface.co/barozp)
- **GGUF status:** Pending llama.cpp support — [issue #22776](https://github.com/ggml-org/llama.cpp/issues/22776)

---

## Original Model Prerequisites

```bash
# vLLM (recommended for serving)
pip install "vllm @ git+https://github.com/Zyphra/vllm.git@zaya1"

# Transformers
pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zaya1"
```

---

## License

[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) — same as the original model.