
ZAYA1-8B-JANGTQ4
Quantized Zyphra/ZAYA1-8B for Apple Silicon runtimes.
| Source | Zyphra/ZAYA1-8B |
| License | Apache-2.0, inherited from upstream |
| Format | JANGTQ4 |
| Bundle size | 4.65 GiB |
| Tensor keys | 1965 |
| Expert layout | Pre-stacked zaya_block.experts.switch_mlp |
| Runtime status | Generation coherence: NOT INDEPENDENTLY PASSED for the quantized runtime bundle (coherence report did not pass); published as a format/runtime bundle pending downstream ZAYA runtime validation. |
Important Runtime Note
This bundle requires a ZAYA-aware JANGTQ runtime that implements CCA attention state plus pre-stacked switch_mlp TurboQuant experts.
ZAYA is not a stock mlx_lm architecture. It alternates CCA attention layers
and top-1 MoE layers. Use this bundle only with a runtime that implements the
ZAYA CCA state contract and the converted pre-stacked expert layout.
Architecture Summary
- 80 decoder layers: 40 CCA attention layers and 40 top-1 MoE layers
- Hidden size 2048, 16 query heads, 2 KV heads, head dim 128
- CCA state per attention layer: standard KV plus
conv_state [B,1280,2]andprev_hs [B,2048] - 16 routed experts per MoE layer, top-1 routing with MOD skip route
- Context length 131072,
rope_theta=5000000
Quantization
4-bit MXTQ routed experts + 8-bit affine non-routed tensors.
Passthrough floor for first release prep:
conv_qk.*,temp, norms, residual scaling, router path, biases, and balancing biases are preserved as float tensors.- Embeddings and
lm_headuse 8-bit affine in the prepared bundles. jangtq_runtime.safetensorsis included: true.
mxtq_bits:
{
"routed_expert": 4,
"attention": 8,
"router": 16,
"embed_tokens": 8,
"lm_head": 8,
"cca_conv": 16,
"norms_residual": 16
}
Bundle Verification
- Safetensor headers scanned.
- Source tensor coverage checked.
- Converted bundles checked for
local_expertsremoval. - Converted expert tensors checked for pre-stacked
switch_mlplayout. - JANGTQ sidecars checked for the Swift runtime contract.
- Runtime coherence status recorded above.
Runtime Smoke Tests
Before production use, run short deterministic prompts through the exact target runtime:
What is 2+2? Answer with only the number.What is the capital of France? Answer with one word.- One chat-template prompt with thinking disabled.
- One chat-template prompt with thinking enabled and enough output budget for the final answer.
The first public bundle release records bundle integrity and runtime contract checks. Full generation quality depends on a ZAYA-aware runtime implementation.
Korean Summary
이 번들은 Zyphra/ZAYA1-8B를 Apple Silicon MLX/JANG 런타임용으로 양자화한 모델입니다. ZAYA의 CCA attention 상태와 MoE 라우팅을 정확히 구현한 런타임에서만 사용해야 합니다.
Files
config.jsoncarriesweight_format=mxtqandzaya_expert_layout=split_switch_mlp.jang_config.jsoncarriescache_subtype=zaya_cca.- Tokenizer files and
chat_template.jinjaare preserved from the upstream source snapshot.
- Downloads last month
- 63
Quantized
Model tree for OsaurusAI/ZAYA1-8B-JANGTQ4
Base model
Zyphra/ZAYA1-8B