File size: 3,580 Bytes
93d426e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
license: apache-2.0
library_name: mlx
base_model: Zyphra/ZAYA1-8B
base_model_relation: quantized
pipeline_tag: text-generation
tags:
  - zaya
  - mixture-of-experts
  - hybrid-attention
  - cca-attention
  - mlx
  - apple-silicon
  - reasoning
  - tool-use
  - quantized
  - mxfp4
  - jang
  - osaurus
quantization_config:
  family: mxfp4
  profile: MXFP4
  group_size: 32
  expert_layout: split_switch_mlp
---

<p align="center"><img src="osaurus-x-banner.png" width="100%" alt="OsaurusAI"/></p>

# ZAYA1-8B-MXFP4

Quantized **Zyphra/ZAYA1-8B** for Apple Silicon runtimes.

| | |
|---|---|
| Source | [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B) |
| License | Apache-2.0, inherited from upstream |
| Format | MXFP4 |
| Bundle size | 5.48 GiB |
| Tensor keys | 1965 |
| Expert layout | Pre-stacked `zaya_block.experts.switch_mlp` |
| Runtime status | Generation coherence: NOT INDEPENDENTLY PASSED for the quantized runtime bundle (coherence report did not pass); published as a format/runtime bundle pending downstream ZAYA runtime validation. |

## Important Runtime Note

This bundle requires a ZAYA-aware MLX/JANG runtime that implements CCA attention state and the converted pre-stacked expert layout.

ZAYA is not a stock `mlx_lm` architecture. It alternates CCA attention layers
and top-1 MoE layers. Use this bundle only with a runtime that implements the
ZAYA CCA state contract and the converted pre-stacked expert layout.

## Architecture Summary

- 80 decoder layers: 40 CCA attention layers and 40 top-1 MoE layers
- Hidden size 2048, 16 query heads, 2 KV heads, head dim 128
- CCA state per attention layer: standard KV plus `conv_state [B,1280,2]`
  and `prev_hs [B,2048]`
- 16 routed experts per MoE layer, top-1 routing with MOD skip route
- Context length 131072, `rope_theta=5000000`

## Quantization

4-bit affine linears + 8-bit embeddings + passthrough router/CCA state tensors.

Passthrough floor for first release prep:

- `conv_qk.*`, `temp`, norms, residual scaling, router path, biases, and
  balancing biases are preserved as float tensors.
- Embeddings and `lm_head` use 8-bit affine in the prepared bundles.
- `jangtq_runtime.safetensors` is not applicable to MXFP4.

`mxtq_bits`:

```json
null
```

## Bundle Verification

- Safetensor headers scanned.
- Source tensor coverage checked.
- Converted bundles checked for `local_experts` removal.
- Converted expert tensors checked for pre-stacked `switch_mlp` layout.
- JANGTQ sidecars checked for the Swift runtime contract.
- Runtime coherence status recorded above.

## Runtime Smoke Tests

Before production use, run short deterministic prompts through the exact target
runtime:

- `What is 2+2? Answer with only the number.`
- `What is the capital of France? Answer with one word.`
- One chat-template prompt with thinking disabled.
- One chat-template prompt with thinking enabled and enough output budget for
  the final answer.

The first public bundle release records bundle integrity and runtime contract
checks. Full generation quality depends on a ZAYA-aware runtime implementation.

## Korean Summary

이 λ²ˆλ“€μ€ Zyphra/ZAYA1-8Bλ₯Ό Apple Silicon MLX/JANG λŸ°νƒ€μž„μš©μœΌλ‘œ μ–‘μžν™”ν•œ λͺ¨λΈμž…λ‹ˆλ‹€. ZAYA의 CCA attention μƒνƒœμ™€ MoE λΌμš°νŒ…μ„ μ •ν™•νžˆ κ΅¬ν˜„ν•œ λŸ°νƒ€μž„μ—μ„œλ§Œ μ‚¬μš©ν•΄μ•Ό ν•©λ‹ˆλ‹€.

## Files

- `config.json` carries `weight_format=mxfp4` and
  `zaya_expert_layout=split_switch_mlp`.
- `jang_config.json` carries `cache_subtype=zaya_cca`.
- Tokenizer files and `chat_template.jinja` are preserved from the upstream
  source snapshot.