File size: 4,264 Bytes
9e6f5b7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
---
license: mit
license_name: deepseek-license
library_name: mlx
base_model: deepseek-ai/DeepSeek-V4-Flash
base_model_relation: quantized
pipeline_tag: text-generation
tags:
  - mlx
  - jang
  - jangtq
  - jangtq2
  - jangtq-prestack
  - mxtq
  - deepseek
  - deepseek-v4
  - deepseek-v4-flash
  - moe
  - mla
  - hash-layers
  - mtp
  - apple-silicon
  - osaurus
---

<p align="center"><img src="osaurus-x-banner.png" width="100%" alt="OsaurusAI"/></p>

# DeepSeek-V4-Flash-JANGTQ2

**DeepSeek-V4-Flash — 79.6 GB on disk** (down from 149 GB FP4+FP8 source) —
uniform **2-bit JANGTQ** quantization on routed experts + 8-bit affine on
everything else + preserved MTP head.

- **Source:** [deepseek-ai/DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash)
  (43 transformer layers + 1 MTP head, **256 routed experts top-6 + 1
  shared expert**, **3 hash layers**, MLA + mHC residuals, ~284 B total)
- **Quantization:** uniform **2-bit MXTQ** on routed-expert MLP +
  8-bit affine on attention (`wq_a/wq_b/wkv/wo_a/wo_b`) / shared
  expert / Compressor / Indexer / embed / lm_head / MTP. RMSNorms,
  router gate, mHC fn matrices, attn_sink, ape stay fp16/fp32
  passthrough.
- **Variant:** `std` (preserves MTP layer 43; one-token-per-forward
  until a JANG runtime ships the accept/reject speculative-decode loop).
  The companion `DeepSeek-V4-Flash-JANGTQ-K` variant drops MTP for a
  smaller bundle.
- **Routed-expert layout:** **pre-stacked** along axis 0 under
  `ffn.experts.switch_mlp.{{gate_proj, up_proj, down_proj}}` per the
  JANGTQ-PRESTACK STANDARD. Sidecar `jangtq_runtime.safetensors`
  (~24 KB) ships both `(in=2048, bits=2)` and `(in=4096, bits=2)`
  codebooks + sign-flip vectors for Swift runtimes.
- **Bundle size:** **~79.6 GB on-disk**
- **Runs on:** M4 Max 128 GB / M5 Max 128 GB / Mac Studio 192 GB+

## Why top-6 + 2-bit holds

DSV4-Flash routes through **6 of 256 experts per token** plus 1 always-on
shared expert and 3 hash layers — so per-token output averages
codebook noise across 7+ pathways. That's a much weaker quality
constraint than top-1 architectures (where every token rides a single
expert's quant error). MiniMax (top-2) and Hy3-preview (top-8) both
ship coherent uniform JANGTQ2; DSV4 sits between them.

## Loading (Python)

```bash
pip install jang-tools mlx-lm
```

```python
from jang_tools.load_jangtq import load_jangtq_model

model, tokenizer = load_jangtq_model("OsaurusAI/DeepSeek-V4-Flash-JANGTQ2")

chat = tokenizer.apply_chat_template(
    [{{"role": "user", "content": "What is 2 + 2? Answer briefly."}}],
    tokenize=False,
    add_generation_prompt=True,
)
```

`load_jangtq_model` auto-registers `model_type=deepseek_v4` via
`jang_tools.dsv4` before building the MLX skeleton. The loader applies
the DSV4-specific MLA absorb + fp32 SDPA + mHC + Compressor + Indexer
patches automatically.

## Runtime support matrix

| Surface | Status |
|---|---|
| `jang-tools` Python (`load_jangtq_model`) | ✅ working |
| `vmlx-swift-lm` Swift | ✅ working — `DeepseekV4JANGTQ` family path |
| MTP speculative decode | preserved-disabled — weights present (variant=std); accept/reject loop not yet in any JANG runtime |

## Validated runtime contract

- 43 transformer layers + 1 MTP head materialize; 40 sparse-MoE layers
  hydrate routed experts via TurboQuantLinear (2-bit MXTQ).
- 33,792 MXTQ tensors / 522 affine / 706 passthrough.
- Capabilities: `family=deepseek_v4`, `reasoning_parser=deepseek_r1`,
  `tool_parser=dsml`, `think_in_template=True`, `cache_type=mla`.

## Reasoning + tools

- **Reasoning parser:** `deepseek_r1`
- **Tool parser:** `dsml` (DeepSeek Markup Language — distinct from
  `deepseek_tool_parser`; see `~/jang/research/DSV4-EVAL-NUANCES.md`)
- **Reasoning template:** `<|thinking_begin|>...<|thinking_end|>` blocks
  via `enable_thinking=True` (default off — pass-through chat mode).
  Greedy `T=0` with `enable_thinking=True` collapses into repetition on
  DSV4; use `T=0.6` for pass@1 like the original DeepSeek release.
- **Cache:** `mla` (Multi-head Latent Attention with kv_lora_rank=512)

## Credits

- **Quantization + MLX runtime:** Jinho Jang (eric@osaurus.ai)
- **Source model:** DeepSeek AI
- **License:** MIT, inherited from upstream