| --- |
| base_model: Zyphra/ZAYA1-8B |
| base_model_relation: quantized |
| language: |
| - en |
| license: apache-2.0 |
| tags: |
| - bitsandbytes |
| - quantized |
| - 4-bit |
| - 8-bit |
| - nf4 |
| - zaya |
| - mixture-of-experts |
| - reasoning |
| pipeline_tag: text-generation |
| --- |
| |
| # ZAYA1-8B — bitsandbytes Quantizations |
|
|
| bitsandbytes quantizations of [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B). |
|
|
| > **Note:** ZAYA1-8B uses a custom sparse MoE architecture (`ZayaForCausalLM`) that is not yet supported by llama.cpp. GGUF files will be added once support lands ([issue #22776](https://github.com/ggml-org/llama.cpp/issues/22776)). In the meantime, these bitsandbytes quantizations provide a working alternative. |
|
|
| --- |
|
|
| ## Available Files |
|
|
| | Folder | Format | Bits | Size | Description | |
| |---|---|---|---|---| |
| | `NF4/` | NF4 | 4-bit | ~5.0 GB | Normal Float 4 — best 4-bit quality | |
| | `NF4-DQ/` | NF4 + DQ | ~4-bit | ~4.7 GB | NF4 + double quantization — slightly smaller | |
| | `INT8/` | INT8 | 8-bit | ~9.0 GB | Near-lossless | |
|
|
| --- |
|
|
| ## About ZAYA1-8B |
|
|
| ZAYA1-8B is a small mixture of experts language model with **760M active parameters** and **8.4B total parameters** trained end-to-end by Zyphra. It sets a new standard of intelligence efficiency for its parameter count through a combination of novel architecture and innovations in pretraining and post-training. |
|
|
| ZAYA1-8B excels at detailed long-form reasoning, especially for mathematical and coding tasks. Due to its small total parameter count, it can also be deployed on-device for local LLM applications. |
|
|
| - **Technical report:** https://www.zyphra.com/zaya1-8b-technical-report |
| - **Blog post:** https://www.zyphra.com/post/zaya1-8b |
| - **Pretraining base:** [Zyphra/ZAYA1-reasoning-base](https://huggingface.co/Zyphra/ZAYA1-reasoning-base) |
|
|
| --- |
|
|
| ## Performance |
|
|
| [](https://huggingface.co/Zyphra/ZAYA1-8B) |
|
|
| [](https://huggingface.co/Zyphra/ZAYA1-8B) |
|
|
| ### In-class comparison |
|
|
| | Category | Benchmark | ZAYA1-8B (0.7B / 8B) | Qwen3-4B-Think | Qwen3.5-4B | Gemma-4-E4B-it | |
| |---|---|---|---|---|---| |
| | Math | AIME'26 | **89.1** | 77.5 | 84.5 | 50.3 | |
| | Math | HMMT Feb.'26 | **71.6** | 60.8 | 63.6 | 32.1 | |
| | Math | IMO-AnswerBench | **59.3** | 50.9 | 48.7 | 27.3 | |
| | Math | APEX-shortlist | **32.2** | 16.9 | -- | 6.1 | |
| | Code | LiveCodeBench-v6 | **65.8** | 54.2 | -- | 54.2 | |
| | Knowledge | GPQA-Diamond | **71.0** | 66.5 | 76.2 | 57.4 | |
| | Knowledge | MMLU-Pro | 74.2 | 74.3 | **79.1** | 70.2 | |
| | Instruction | IFEval | 85.58 | 86.8 | **89.8** | 88.50 | |
| | Instruction | IFBench | 52.56 | 52.9 | **59.2** | 42.67 | |
| | Style & chat | EQBench | 72.95 | 79.6 | 79.5 | **80.15** | |
| | Agentic | BFCL-v4 | 39.22 | **49.7** | 45.2 | 31.7 | |
|
|
| ### Scaling comparison against larger models |
|
|
| | Model | Active | Total | AIME'26 | HMMT'26 | LCB-v6 | GPQA-D | MMLU-Pro | |
| |---|---|---|---|---|---|---|---| |
| | **ZAYA1-8B** | **0.7B** | **8B** | **89.1** | **71.6** | 63.8 | 71.0 | 74.2 | |
| | Arcee-Trinity-Mini | 3B | 26B | 59.6 | 36.9 | 33.3 | 46.8 | 70.6 | |
| | N3-Nano-30B | 3B | 30B | 90.1 | 75.5 | **64.6** | **75.1** | **78.9** | |
| | OLMo-3.1-32B-Think | 32B | 32B | 78.9 | 50.6 | 58.3 | 59.6 | 75.8 | |
| | Qwen3-Next-80B-A3B | 3B | 80B | 90.2 | 79.3 | 67.8 | 76.7 | 82.6 | |
| | Intellect-3 | 12B | 106B | 86.3 | 72.2 | 66.8 | 74.6 | 82.3 | |
| | Mistral-Small-4-119B | 6B | 119B | 86.4 | 70.6 | 57.9 | 77.2 | 81.6 | |
|
|
| *All numbers from the Zyphra evaluation harness. Models ordered by total parameter count.* |
|
|
| --- |
|
|
| ## Download |
|
|
| > HuggingFace's inference widget and one-click download are not available for this repo. |
| > `ZayaForCausalLM` requires Zyphra's custom `transformers` fork — use the commands below. |
|
|
| ### Download a specific quantization |
|
|
| ```bash |
| # NF4 (4-bit) — recommended |
| huggingface-cli download barozp/ZAYA1-8B-BNB --include "NF4/*" --local-dir ./ZAYA1-8B-NF4 |
| |
| # NF4 with double quantization |
| huggingface-cli download barozp/ZAYA1-8B-BNB --include "NF4-DQ/*" --local-dir ./ZAYA1-8B-NF4-DQ |
| |
| # INT8 (8-bit) |
| huggingface-cli download barozp/ZAYA1-8B-BNB --include "INT8/*" --local-dir ./ZAYA1-8B-INT8 |
| ``` |
|
|
| ### Download everything |
|
|
| ```bash |
| huggingface-cli download barozp/ZAYA1-8B-BNB --local-dir ./ZAYA1-8B-BNB |
| ``` |
|
|
| --- |
|
|
| ## Usage |
|
|
| Zyphra's custom `transformers` fork is required to load `ZayaForCausalLM`: |
|
|
| ```bash |
| pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zaya1" |
| pip install bitsandbytes>=0.43.0 accelerate |
| ``` |
|
|
| ### Load NF4 (4-bit) |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig |
| import torch |
| |
| bnb_config = BitsAndBytesConfig( |
| load_in_4bit=True, |
| bnb_4bit_quant_type="nf4", |
| bnb_4bit_compute_dtype=torch.bfloat16, |
| ) |
| |
| tokenizer = AutoTokenizer.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="NF4", |
| trust_remote_code=True) |
| model = AutoModelForCausalLM.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="NF4", |
| quantization_config=bnb_config, |
| device_map="auto", |
| trust_remote_code=True) |
| ``` |
|
|
| ### Load INT8 (8-bit) |
|
|
| ```python |
| bnb_config = BitsAndBytesConfig(load_in_8bit=True) |
| |
| tokenizer = AutoTokenizer.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="INT8", |
| trust_remote_code=True) |
| model = AutoModelForCausalLM.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="INT8", |
| quantization_config=bnb_config, |
| device_map="auto", |
| trust_remote_code=True) |
| ``` |
|
|
| ### Inference |
|
|
| ```python |
| messages = [ |
| {"role": "system", "content": "You are a helpful assistant."}, |
| {"role": "user", "content": "What is the sum of the first 100 prime numbers?"}, |
| ] |
| |
| input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device) |
| output = model.generate(input_ids, max_new_tokens=512) |
| print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True)) |
| ``` |
|
|
| --- |
|
|
| ## Quantization Details |
|
|
| - **Source:** [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B) (BF16 safetensors) |
| - **Method:** [bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes) |
| - **Quantized by:** [barozp](https://huggingface.co/barozp) |
| - **GGUF status:** Pending llama.cpp support — [issue #22776](https://github.com/ggml-org/llama.cpp/issues/22776) |
|
|
| --- |
|
|
| ## Original Model Prerequisites |
|
|
| ```bash |
| # vLLM (recommended for serving) |
| pip install "vllm @ git+https://github.com/Zyphra/vllm.git@zaya1" |
| |
| # Transformers |
| pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zaya1" |
| ``` |
|
|
| --- |
|
|
| ## License |
|
|
| [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) — same as the original model. |
|
|