--- base_model: Zyphra/ZAYA1-8B base_model_relation: quantized language: - en license: apache-2.0 tags: - bitsandbytes - quantized - 4-bit - 8-bit - nf4 - zaya - mixture-of-experts - reasoning pipeline_tag: text-generation --- # ZAYA1-8B — bitsandbytes Quantizations bitsandbytes quantizations of [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B). > **Note:** ZAYA1-8B uses a custom sparse MoE architecture (`ZayaForCausalLM`) that is not yet supported by llama.cpp. GGUF files will be added once support lands ([issue #22776](https://github.com/ggml-org/llama.cpp/issues/22776)). In the meantime, these bitsandbytes quantizations provide a working alternative. --- ## Available Files | Folder | Format | Bits | Size | Description | |---|---|---|---|---| | `NF4/` | NF4 | 4-bit | ~5.0 GB | Normal Float 4 — best 4-bit quality | | `NF4-DQ/` | NF4 + DQ | ~4-bit | ~4.7 GB | NF4 + double quantization — slightly smaller | | `INT8/` | INT8 | 8-bit | ~9.0 GB | Near-lossless | --- ## About ZAYA1-8B ZAYA1-8B is a small mixture of experts language model with **760M active parameters** and **8.4B total parameters** trained end-to-end by Zyphra. It sets a new standard of intelligence efficiency for its parameter count through a combination of novel architecture and innovations in pretraining and post-training. ZAYA1-8B excels at detailed long-form reasoning, especially for mathematical and coding tasks. Due to its small total parameter count, it can also be deployed on-device for local LLM applications. - **Technical report:** https://www.zyphra.com/zaya1-8b-technical-report - **Blog post:** https://www.zyphra.com/post/zaya1-8b - **Pretraining base:** [Zyphra/ZAYA1-reasoning-base](https://huggingface.co/Zyphra/ZAYA1-reasoning-base) --- ## Performance [![Performance chart](https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/f5tbexK3BumixnJuBZxo_.png)](https://huggingface.co/Zyphra/ZAYA1-8B) [![Scaling comparison](https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/W8bn6ZAocWKFuicjtjesv.png)](https://huggingface.co/Zyphra/ZAYA1-8B) ### In-class comparison | Category | Benchmark | ZAYA1-8B (0.7B / 8B) | Qwen3-4B-Think | Qwen3.5-4B | Gemma-4-E4B-it | |---|---|---|---|---|---| | Math | AIME'26 | **89.1** | 77.5 | 84.5 | 50.3 | | Math | HMMT Feb.'26 | **71.6** | 60.8 | 63.6 | 32.1 | | Math | IMO-AnswerBench | **59.3** | 50.9 | 48.7 | 27.3 | | Math | APEX-shortlist | **32.2** | 16.9 | -- | 6.1 | | Code | LiveCodeBench-v6 | **65.8** | 54.2 | -- | 54.2 | | Knowledge | GPQA-Diamond | **71.0** | 66.5 | 76.2 | 57.4 | | Knowledge | MMLU-Pro | 74.2 | 74.3 | **79.1** | 70.2 | | Instruction | IFEval | 85.58 | 86.8 | **89.8** | 88.50 | | Instruction | IFBench | 52.56 | 52.9 | **59.2** | 42.67 | | Style & chat | EQBench | 72.95 | 79.6 | 79.5 | **80.15** | | Agentic | BFCL-v4 | 39.22 | **49.7** | 45.2 | 31.7 | ### Scaling comparison against larger models | Model | Active | Total | AIME'26 | HMMT'26 | LCB-v6 | GPQA-D | MMLU-Pro | |---|---|---|---|---|---|---|---| | **ZAYA1-8B** | **0.7B** | **8B** | **89.1** | **71.6** | 63.8 | 71.0 | 74.2 | | Arcee-Trinity-Mini | 3B | 26B | 59.6 | 36.9 | 33.3 | 46.8 | 70.6 | | N3-Nano-30B | 3B | 30B | 90.1 | 75.5 | **64.6** | **75.1** | **78.9** | | OLMo-3.1-32B-Think | 32B | 32B | 78.9 | 50.6 | 58.3 | 59.6 | 75.8 | | Qwen3-Next-80B-A3B | 3B | 80B | 90.2 | 79.3 | 67.8 | 76.7 | 82.6 | | Intellect-3 | 12B | 106B | 86.3 | 72.2 | 66.8 | 74.6 | 82.3 | | Mistral-Small-4-119B | 6B | 119B | 86.4 | 70.6 | 57.9 | 77.2 | 81.6 | *All numbers from the Zyphra evaluation harness. Models ordered by total parameter count.* --- ## Download > HuggingFace's inference widget and one-click download are not available for this repo. > `ZayaForCausalLM` requires Zyphra's custom `transformers` fork — use the commands below. ### Download a specific quantization ```bash # NF4 (4-bit) — recommended huggingface-cli download barozp/ZAYA1-8B-BNB --include "NF4/*" --local-dir ./ZAYA1-8B-NF4 # NF4 with double quantization huggingface-cli download barozp/ZAYA1-8B-BNB --include "NF4-DQ/*" --local-dir ./ZAYA1-8B-NF4-DQ # INT8 (8-bit) huggingface-cli download barozp/ZAYA1-8B-BNB --include "INT8/*" --local-dir ./ZAYA1-8B-INT8 ``` ### Download everything ```bash huggingface-cli download barozp/ZAYA1-8B-BNB --local-dir ./ZAYA1-8B-BNB ``` --- ## Usage Zyphra's custom `transformers` fork is required to load `ZayaForCausalLM`: ```bash pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zaya1" pip install bitsandbytes>=0.43.0 accelerate ``` ### Load NF4 (4-bit) ```python from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig import torch bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, ) tokenizer = AutoTokenizer.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="NF4", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="NF4", quantization_config=bnb_config, device_map="auto", trust_remote_code=True) ``` ### Load INT8 (8-bit) ```python bnb_config = BitsAndBytesConfig(load_in_8bit=True) tokenizer = AutoTokenizer.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="INT8", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("barozp/ZAYA1-8B-BNB", subfolder="INT8", quantization_config=bnb_config, device_map="auto", trust_remote_code=True) ``` ### Inference ```python messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the sum of the first 100 prime numbers?"}, ] input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device) output = model.generate(input_ids, max_new_tokens=512) print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True)) ``` --- ## Quantization Details - **Source:** [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B) (BF16 safetensors) - **Method:** [bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes) - **Quantized by:** [barozp](https://huggingface.co/barozp) - **GGUF status:** Pending llama.cpp support — [issue #22776](https://github.com/ggml-org/llama.cpp/issues/22776) --- ## Original Model Prerequisites ```bash # vLLM (recommended for serving) pip install "vllm @ git+https://github.com/Zyphra/vllm.git@zaya1" # Transformers pip install "transformers @ git+https://github.com/Zyphra/transformers.git@zaya1" ``` --- ## License [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) — same as the original model.