Upload README.md with huggingface_hub

55ec602 verified 17 days ago

7.04 kB

	---
	base_model: Zyphra/ZAYA1-8B
	base_model_relation: quantized
	language:
	- en
	license: apache-2.0
	tags:
	- bitsandbytes
	- quantized
	- 4-bit
	- 8-bit
	- nf4
	- zaya
	- mixture-of-experts
	- reasoning
	pipeline_tag: text-generation
	---

	# ZAYA1-8B — bitsandbytes Quantizations

	bitsandbytes quantizations of [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B).

	> Note: ZAYA1-8B uses a custom sparse MoE architecture (`ZayaForCausalLM`) that is not yet supported by llama.cpp. GGUF files will be added once support lands ([issue #22776](https://github.com/ggml-org/llama.cpp/issues/22776)). In the meantime, these bitsandbytes quantizations provide a working alternative.

	---

	## Available Files

	\| Folder \| Format \| Bits \| Size \| Description \|
	\|---\|---\|---\|---\|---\|
	\| `NF4/` \| NF4 \| 4-bit \| ~5.0 GB \| Normal Float 4 — best 4-bit quality \|
	\| `NF4-DQ/` \| NF4 + DQ \| ~4-bit \| ~4.7 GB \| NF4 + double quantization — slightly smaller \|
	\| `INT8/` \| INT8 \| 8-bit \| ~9.0 GB \| Near-lossless \|

	---

	## About ZAYA1-8B

	ZAYA1-8B is a small mixture of experts language model with 760M active parameters and 8.4B total parameters trained end-to-end by Zyphra. It sets a new standard of intelligence efficiency for its parameter count through a combination of novel architecture and innovations in pretraining and post-training.

	ZAYA1-8B excels at detailed long-form reasoning, especially for mathematical and coding tasks. Due to its small total parameter count, it can also be deployed on-device for local LLM applications.

	- Technical report: https://www.zyphra.com/zaya1-8b-technical-report
	- Blog post: https://www.zyphra.com/post/zaya1-8b
	- Pretraining base: [Zyphra/ZAYA1-reasoning-base](https://huggingface.co/Zyphra/ZAYA1-reasoning-base)

	---

	## Performance

	[![Performance chart](https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/f5tbexK3BumixnJuBZxo_.png)](https://huggingface.co/Zyphra/ZAYA1-8B)

	[![Scaling comparison](https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/W8bn6ZAocWKFuicjtjesv.png)](https://huggingface.co/Zyphra/ZAYA1-8B)

	### In-class comparison

	\| Category \| Benchmark \| ZAYA1-8B (0.7B / 8B) \| Qwen3-4B-Think \| Qwen3.5-4B \| Gemma-4-E4B-it \|
	\|---\|---\|---\|---\|---\|---\|
	\| Math \| AIME'26 \| 89.1 \| 77.5 \| 84.5 \| 50.3 \|
	\| Math \| HMMT Feb.'26 \| 71.6 \| 60.8 \| 63.6 \| 32.1 \|
	\| Math \| IMO-AnswerBench \| 59.3 \| 50.9 \| 48.7 \| 27.3 \|
	\| Math \| APEX-shortlist \| 32.2 \| 16.9 \| -- \| 6.1 \|
	\| Code \| LiveCodeBench-v6 \| 65.8 \| 54.2 \| -- \| 54.2 \|
	\| Knowledge \| GPQA-Diamond \| 71.0 \| 66.5 \| 76.2 \| 57.4 \|
	\| Knowledge \| MMLU-Pro \| 74.2 \| 74.3 \| 79.1 \| 70.2 \|
	\| Instruction \| IFEval \| 85.58 \| 86.8 \| 89.8 \| 88.50 \|
	\| Instruction \| IFBench \| 52.56 \| 52.9 \| 59.2 \| 42.67 \|
	\| Style & chat \| EQBench \| 72.95 \| 79.6 \| 79.5 \| 80.15 \|
	\| Agentic \| BFCL-v4 \| 39.22 \| 49.7 \| 45.2 \| 31.7 \|

	### Scaling comparison against larger models

	\| Model \| Active \| Total \| AIME'26 \| HMMT'26 \| LCB-v6 \| GPQA-D \| MMLU-Pro \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| ZAYA1-8B \| 0.7B \| 8B \| 89.1 \| 71.6 \| 63.8 \| 71.0 \| 74.2 \|
	\| Arcee-Trinity-Mini \| 3B \| 26B \| 59.6 \| 36.9 \| 33.3 \| 46.8 \| 70.6 \|
	\| N3-Nano-30B \| 3B \| 30B \| 90.1 \| 75.5 \| 64.6 \| 75.1 \| 78.9 \|
	\| OLMo-3.1-32B-Think \| 32B \| 32B \| 78.9 \| 50.6 \| 58.3 \| 59.6 \| 75.8 \|
	\| Qwen3-Next-80B-A3B \| 3B \| 80B \| 90.2 \| 79.3 \| 67.8 \| 76.7 \| 82.6 \|
	\| Intellect-3 \| 12B \| 106B \| 86.3 \| 72.2 \| 66.8 \| 74.6 \| 82.3 \|
	\| Mistral-Small-4-119B \| 6B \| 119B \| 86.4 \| 70.6 \| 57.9 \| 77.2 \| 81.6 \|

	All numbers from the Zyphra evaluation harness. Models ordered by total parameter count.

	---

	## Download

	> HuggingFace's inference widget and one-click download are not available for this repo.
	> `ZayaForCausalLM` requires Zyphra's custom `transformers` fork — use the commands below.

	### Download a specific quantization

	```bash
	# NF4 (4-bit) — recommended
	huggingface-cli download barozp/ZAYA1-8B-BNB --include "NF4/*" --local-dir ./ZAYA1-8B-NF4

	# NF4 with double quantization
	huggingface-cli download barozp/ZAYA1-8B-BNB --include "NF4-DQ/*" --local-dir ./ZAYA1-8B-NF4-DQ

	# INT8 (8-bit)
	huggingface-cli download barozp/ZAYA1-8B-BNB --include "INT8/*" --local-dir ./ZAYA1-8B-INT8
	```

	### Download everything

	```bash
	huggingface-cli download barozp/ZAYA1-8B-BNB --local-dir ./ZAYA1-8B-BNB
	```

	---

	## Usage

	Zyphra's custom `transformers` fork is required to load `ZayaForCausalLM`:

	```bash
	pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zaya1"
	pip install bitsandbytes>=0.43.0 accelerate
	```

	### Load NF4 (4-bit)

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
	import torch

	bnb_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype=torch.bfloat16,
	)

	tokenizer = AutoTokenizer.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="NF4",
	trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="NF4",
	quantization_config=bnb_config,
	device_map="auto",
	trust_remote_code=True)
	```

	### Load INT8 (8-bit)

	```python
	bnb_config = BitsAndBytesConfig(load_in_8bit=True)

	tokenizer = AutoTokenizer.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="INT8",
	trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="INT8",
	quantization_config=bnb_config,
	device_map="auto",
	trust_remote_code=True)
	```

	### Inference

	```python
	messages = [
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": "What is the sum of the first 100 prime numbers?"},
	]

	input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
	output = model.generate(input_ids, max_new_tokens=512)
	print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True))
	```

	---

	## Quantization Details

	- Source: [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B) (BF16 safetensors)
	- Method: [bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes)
	- Quantized by: [barozp](https://huggingface.co/barozp)
	- GGUF status: Pending llama.cpp support — [issue #22776](https://github.com/ggml-org/llama.cpp/issues/22776)

	---

	## Original Model Prerequisites

	```bash
	# vLLM (recommended for serving)
	pip install "vllm @ git+https://github.com/Zyphra/vllm.git@zaya1"

	# Transformers
	pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zaya1"
	```

	---

	## License

	[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) — same as the original model.