| --- |
| license: apache-2.0 |
| library_name: mlx |
| base_model: Zyphra/ZAYA1-8B |
| base_model_relation: quantized |
| pipeline_tag: text-generation |
| tags: |
| - zaya |
| - mixture-of-experts |
| - hybrid-attention |
| - cca-attention |
| - mlx |
| - apple-silicon |
| - reasoning |
| - tool-use |
| - quantized |
| - jang |
| - jangtq |
| - mxtq |
| - jangtq-prestack |
| - osaurus |
| quantization_config: |
| family: jangtq |
| profile: JANGTQ2 |
| group_size: 32 |
| expert_layout: split_switch_mlp |
| --- |
| |
| <p align="center"><img src="osaurus-x-banner.png" width="100%" alt="OsaurusAI"/></p> |
|
|
| # ZAYA1-8B-JANGTQ2 |
|
|
| Quantized **Zyphra/ZAYA1-8B** for Apple Silicon runtimes. |
|
|
| | | | |
| |---|---| |
| | Source | [Zyphra/ZAYA1-8B](https://huggingface.co/Zyphra/ZAYA1-8B) | |
| | License | Apache-2.0, inherited from upstream | |
| | Format | JANGTQ2 | |
| | Modality | text | |
| | Bundle size | 2.77 GiB | |
| | Tensor keys | 1965 | |
| | Expert layout | Pre-stacked `zaya_block.experts.switch_mlp` | |
| | Runtime status | Generation coherence: NOT INDEPENDENTLY PASSED for the quantized runtime bundle (missing coherence report); published as a format/runtime bundle pending downstream ZAYA runtime validation. | |
|
|
| ## Important Runtime Note |
|
|
| This bundle requires a ZAYA-aware JANGTQ runtime that implements CCA attention state plus pre-stacked `switch_mlp` TurboQuant experts. |
|
|
| ZAYA is not a stock `mlx_lm` architecture. It alternates CCA attention layers and top-1 MoE layers. Use this bundle only with a runtime that implements the ZAYA CCA state contract and the converted pre-stacked expert layout. |
|
|
| ## Runtime Pin Required |
|
|
| Use a `vmlx-swift-lm` build that includes the ZAYA Swift runtime (`Libraries/MLXLLM/Models/Zaya.swift` + `MLXLMCommon/Cache/ZayaCCACache.swift` + `BatchEngine/BatchZayaCCACache.swift`). The first verified pin is commit `b9da180` or newer. |
|
|
|
|
| ## Architecture Summary |
|
|
| - 80 decoder layers: alternating CCA attention and top-1 MoE |
| - Hidden size 2048, 16 query heads, 2 KV heads, head dim ? |
| - CCA state per attention layer: standard KV plus `conv_state [B,1280,2]` and `prev_hs [B,2048]` |
| - 16 routed experts per MoE layer, top-1 routing with MOD skip route |
| - Context length 131072, `rope_theta=5000000` |
|
|
| ## Quantization |
|
|
| 2-bit MXTQ routed experts + 8-bit affine non-routed tensors. |
|
|
| Passthrough floor for first release prep: |
|
|
| - `conv_qk.*`, `temp`, norms, residual scaling, router path, biases, and balancing biases are preserved as float tensors. |
| - Embeddings and `lm_head` use 8-bit affine in the prepared bundles. |
| - Text-only ZAYA1-8B has no vision_tower or LoRA tensors. |
| - `jangtq_runtime.safetensors` is included: true. |
|
|
| `mxtq_bits`: |
|
|
| ```json |
| { |
| "routed_expert": 2, |
| "attention": 8, |
| "router": 16, |
| "embed_tokens": 8, |
| "lm_head": 8, |
| "cca_conv": 16, |
| "norms_residual": 16 |
| } |
| ``` |
|
|
| ## Bundle Verification |
|
|
| - Safetensor headers scanned. |
| - Source tensor coverage checked. |
| - Converted bundles checked for `local_experts` removal. |
| - Converted expert tensors checked for pre-stacked `switch_mlp` layout. |
| - JANGTQ sidecars checked for the Swift runtime contract. |
| - Capabilities verified: family=zaya, supports_thinking=False, tool_parser=zaya_xml. |
| - Runtime coherence status recorded above. |
| |
| ## Runtime Smoke Tests |
| |
| Before production use, run short deterministic prompts through the exact target runtime: |
| |
| - `What is 2+2? Answer with only the number.` |
| - `What is the capital of France? Answer with one word.` |
| - One chat-template prompt with thinking disabled. |
| - One chat-template prompt with thinking enabled and enough output budget for the final answer. |
| |
| The first public bundle release records bundle integrity and runtime contract checks. Full generation quality depends on a ZAYA-aware runtime implementation. |
| |
| ## Korean Summary |
| |
| μ΄ λ²λ€μ Zyphra/ZAYA1-8Bλ₯Ό Apple Silicon MLX/JANG λ°νμμ©μΌλ‘ μμνν λͺ¨λΈμ
λλ€. ZAYAμ CCA attention μνμ MoE λΌμ°ν
μ μ νν ꡬνν λ°νμμμλ§ μ¬μ©ν΄μΌ ν©λλ€. |
| |
| ## Files |
| |
| - `config.json` carries `weight_format=mxtq`, `zaya_expert_layout=split_switch_mlp`. |
| - `jang_config.json` carries `cache_subtype=zaya_cca`. |
| - Tokenizer files and chat template are preserved from the upstream source snapshot. |
|
|
|
|