ZAYA1-8B-BNB / README.md
barozp's picture
Upload README.md with huggingface_hub
55ec602 verified
---
base_model: Zyphra/ZAYA1-8B
base_model_relation: quantized
language:
- en
license: apache-2.0
tags:
- bitsandbytes
- quantized
- 4-bit
- 8-bit
- nf4
- zaya
- mixture-of-experts
- reasoning
pipeline_tag: text-generation
---
# ZAYA1-8B — bitsandbytes Quantizations
bitsandbytes quantizations of [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B).
> **Note:** ZAYA1-8B uses a custom sparse MoE architecture (`ZayaForCausalLM`) that is not yet supported by llama.cpp. GGUF files will be added once support lands ([issue #22776](https://github.com/ggml-org/llama.cpp/issues/22776)). In the meantime, these bitsandbytes quantizations provide a working alternative.
---
## Available Files
| Folder | Format | Bits | Size | Description |
|---|---|---|---|---|
| `NF4/` | NF4 | 4-bit | ~5.0 GB | Normal Float 4 — best 4-bit quality |
| `NF4-DQ/` | NF4 + DQ | ~4-bit | ~4.7 GB | NF4 + double quantization — slightly smaller |
| `INT8/` | INT8 | 8-bit | ~9.0 GB | Near-lossless |
---
## About ZAYA1-8B
ZAYA1-8B is a small mixture of experts language model with **760M active parameters** and **8.4B total parameters** trained end-to-end by Zyphra. It sets a new standard of intelligence efficiency for its parameter count through a combination of novel architecture and innovations in pretraining and post-training.
ZAYA1-8B excels at detailed long-form reasoning, especially for mathematical and coding tasks. Due to its small total parameter count, it can also be deployed on-device for local LLM applications.
- **Technical report:** https://www.zyphra.com/zaya1-8b-technical-report
- **Blog post:** https://www.zyphra.com/post/zaya1-8b
- **Pretraining base:** [Zyphra/ZAYA1-reasoning-base](https://huggingface.co/Zyphra/ZAYA1-reasoning-base)
---
## Performance
[![Performance chart](https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/f5tbexK3BumixnJuBZxo_.png)](https://huggingface.co/Zyphra/ZAYA1-8B)
[![Scaling comparison](https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/W8bn6ZAocWKFuicjtjesv.png)](https://huggingface.co/Zyphra/ZAYA1-8B)
### In-class comparison
| Category | Benchmark | ZAYA1-8B (0.7B / 8B) | Qwen3-4B-Think | Qwen3.5-4B | Gemma-4-E4B-it |
|---|---|---|---|---|---|
| Math | AIME'26 | **89.1** | 77.5 | 84.5 | 50.3 |
| Math | HMMT Feb.'26 | **71.6** | 60.8 | 63.6 | 32.1 |
| Math | IMO-AnswerBench | **59.3** | 50.9 | 48.7 | 27.3 |
| Math | APEX-shortlist | **32.2** | 16.9 | -- | 6.1 |
| Code | LiveCodeBench-v6 | **65.8** | 54.2 | -- | 54.2 |
| Knowledge | GPQA-Diamond | **71.0** | 66.5 | 76.2 | 57.4 |
| Knowledge | MMLU-Pro | 74.2 | 74.3 | **79.1** | 70.2 |
| Instruction | IFEval | 85.58 | 86.8 | **89.8** | 88.50 |
| Instruction | IFBench | 52.56 | 52.9 | **59.2** | 42.67 |
| Style & chat | EQBench | 72.95 | 79.6 | 79.5 | **80.15** |
| Agentic | BFCL-v4 | 39.22 | **49.7** | 45.2 | 31.7 |
### Scaling comparison against larger models
| Model | Active | Total | AIME'26 | HMMT'26 | LCB-v6 | GPQA-D | MMLU-Pro |
|---|---|---|---|---|---|---|---|
| **ZAYA1-8B** | **0.7B** | **8B** | **89.1** | **71.6** | 63.8 | 71.0 | 74.2 |
| Arcee-Trinity-Mini | 3B | 26B | 59.6 | 36.9 | 33.3 | 46.8 | 70.6 |
| N3-Nano-30B | 3B | 30B | 90.1 | 75.5 | **64.6** | **75.1** | **78.9** |
| OLMo-3.1-32B-Think | 32B | 32B | 78.9 | 50.6 | 58.3 | 59.6 | 75.8 |
| Qwen3-Next-80B-A3B | 3B | 80B | 90.2 | 79.3 | 67.8 | 76.7 | 82.6 |
| Intellect-3 | 12B | 106B | 86.3 | 72.2 | 66.8 | 74.6 | 82.3 |
| Mistral-Small-4-119B | 6B | 119B | 86.4 | 70.6 | 57.9 | 77.2 | 81.6 |
*All numbers from the Zyphra evaluation harness. Models ordered by total parameter count.*
---
## Download
> HuggingFace's inference widget and one-click download are not available for this repo.
> `ZayaForCausalLM` requires Zyphra's custom `transformers` fork — use the commands below.
### Download a specific quantization
```bash
# NF4 (4-bit) — recommended
huggingface-cli download barozp/ZAYA1-8B-BNB --include "NF4/*" --local-dir ./ZAYA1-8B-NF4
# NF4 with double quantization
huggingface-cli download barozp/ZAYA1-8B-BNB --include "NF4-DQ/*" --local-dir ./ZAYA1-8B-NF4-DQ
# INT8 (8-bit)
huggingface-cli download barozp/ZAYA1-8B-BNB --include "INT8/*" --local-dir ./ZAYA1-8B-INT8
```
### Download everything
```bash
huggingface-cli download barozp/ZAYA1-8B-BNB --local-dir ./ZAYA1-8B-BNB
```
---
## Usage
Zyphra's custom `transformers` fork is required to load `ZayaForCausalLM`:
```bash
pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zaya1"
pip install bitsandbytes>=0.43.0 accelerate
```
### Load NF4 (4-bit)
```python
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="NF4",
trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="NF4",
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True)
```
### Load INT8 (8-bit)
```python
bnb_config = BitsAndBytesConfig(load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="INT8",
trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="INT8",
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True)
```
### Inference
```python
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the sum of the first 100 prime numbers?"},
]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_new_tokens=512)
print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True))
```
---
## Quantization Details
- **Source:** [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B) (BF16 safetensors)
- **Method:** [bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes)
- **Quantized by:** [barozp](https://huggingface.co/barozp)
- **GGUF status:** Pending llama.cpp support — [issue #22776](https://github.com/ggml-org/llama.cpp/issues/22776)
---
## Original Model Prerequisites
```bash
# vLLM (recommended for serving)
pip install "vllm @ git+https://github.com/Zyphra/vllm.git@zaya1"
# Transformers
pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zaya1"
```
---
## License
[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) — same as the original model.