---
license: mit
license_name: deepseek-license
library_name: mlx
base_model: deepseek-ai/DeepSeek-V4-Flash
base_model_relation: quantized
pipeline_tag: text-generation
tags:
- mlx
- jang
- jangtq
- jangtq2
- jangtq-prestack
- mxtq
- deepseek
- deepseek-v4
- deepseek-v4-flash
- moe
- mla
- hash-layers
- mtp
- apple-silicon
- osaurus
---

# DeepSeek-V4-Flash-JANGTQ2
**DeepSeek-V4-Flash — 79.6 GB on disk** (down from 149 GB FP4+FP8 source) —
uniform **2-bit JANGTQ** quantization on routed experts + 8-bit affine on
everything else + preserved MTP head.
- **Source:** [deepseek-ai/DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash)
(43 transformer layers + 1 MTP head, **256 routed experts top-6 + 1
shared expert**, **3 hash layers**, MLA + mHC residuals, ~284 B total)
- **Quantization:** uniform **2-bit MXTQ** on routed-expert MLP +
8-bit affine on attention (`wq_a/wq_b/wkv/wo_a/wo_b`) / shared
expert / Compressor / Indexer / embed / lm_head / MTP. RMSNorms,
router gate, mHC fn matrices, attn_sink, ape stay fp16/fp32
passthrough.
- **Variant:** `std` (preserves MTP layer 43; one-token-per-forward
until a JANG runtime ships the accept/reject speculative-decode loop).
The companion `DeepSeek-V4-Flash-JANGTQ-K` variant drops MTP for a
smaller bundle.
- **Routed-expert layout:** **pre-stacked** along axis 0 under
`ffn.experts.switch_mlp.{{gate_proj, up_proj, down_proj}}` per the
JANGTQ-PRESTACK STANDARD. Sidecar `jangtq_runtime.safetensors`
(~24 KB) ships both `(in=2048, bits=2)` and `(in=4096, bits=2)`
codebooks + sign-flip vectors for Swift runtimes.
- **Bundle size:** **~79.6 GB on-disk**
- **Runs on:** M4 Max 128 GB / M5 Max 128 GB / Mac Studio 192 GB+
## Why top-6 + 2-bit holds
DSV4-Flash routes through **6 of 256 experts per token** plus 1 always-on
shared expert and 3 hash layers — so per-token output averages
codebook noise across 7+ pathways. That's a much weaker quality
constraint than top-1 architectures (where every token rides a single
expert's quant error). MiniMax (top-2) and Hy3-preview (top-8) both
ship coherent uniform JANGTQ2; DSV4 sits between them.
## Loading (Python)
```bash
pip install jang-tools mlx-lm
```
```python
from jang_tools.load_jangtq import load_jangtq_model
model, tokenizer = load_jangtq_model("OsaurusAI/DeepSeek-V4-Flash-JANGTQ2")
chat = tokenizer.apply_chat_template(
[{{"role": "user", "content": "What is 2 + 2? Answer briefly."}}],
tokenize=False,
add_generation_prompt=True,
)
```
`load_jangtq_model` auto-registers `model_type=deepseek_v4` via
`jang_tools.dsv4` before building the MLX skeleton. The loader applies
the DSV4-specific MLA absorb + fp32 SDPA + mHC + Compressor + Indexer
patches automatically.
## Runtime support matrix
| Surface | Status |
|---|---|
| `jang-tools` Python (`load_jangtq_model`) | ✅ working |
| `vmlx-swift-lm` Swift | ✅ working — `DeepseekV4JANGTQ` family path |
| MTP speculative decode | preserved-disabled — weights present (variant=std); accept/reject loop not yet in any JANG runtime |
## Validated runtime contract
- 43 transformer layers + 1 MTP head materialize; 40 sparse-MoE layers
hydrate routed experts via TurboQuantLinear (2-bit MXTQ).
- 33,792 MXTQ tensors / 522 affine / 706 passthrough.
- Capabilities: `family=deepseek_v4`, `reasoning_parser=deepseek_r1`,
`tool_parser=dsml`, `think_in_template=True`, `cache_type=mla`.
## Reasoning + tools
- **Reasoning parser:** `deepseek_r1`
- **Tool parser:** `dsml` (DeepSeek Markup Language — distinct from
`deepseek_tool_parser`; see `~/jang/research/DSV4-EVAL-NUANCES.md`)
- **Reasoning template:** `<|thinking_begin|>...<|thinking_end|>` blocks
via `enable_thinking=True` (default off — pass-through chat mode).
Greedy `T=0` with `enable_thinking=True` collapses into repetition on
DSV4; use `T=0.6` for pass@1 like the original DeepSeek release.
- **Cache:** `mla` (Multi-head Latent Attention with kv_lora_rank=512)
## Credits
- **Quantization + MLX runtime:** Jinho Jang (eric@osaurus.ai)
- **Source model:** DeepSeek AI
- **License:** MIT, inherited from upstream