Text Generation
MLX
Safetensors
deepseek_v4
jang
jangtq
jangtq2
jangtq-prestack
mxtq
deepseek
deepseek-v4
deepseek-v4-flash
Mixture of Experts
mla
hash-layers
mtp
apple-silicon
osaurus
Instructions to use OsaurusAI/DeepSeek-V4-Flash-JANGTQ2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use OsaurusAI/DeepSeek-V4-Flash-JANGTQ2 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # if on a CUDA device, also pip install mlx[cuda] # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("OsaurusAI/DeepSeek-V4-Flash-JANGTQ2") prompt = "Once upon a time in" text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- MLX LM
How to use OsaurusAI/DeepSeek-V4-Flash-JANGTQ2 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Generate some text mlx_lm.generate --model "OsaurusAI/DeepSeek-V4-Flash-JANGTQ2" --prompt "Once upon a time"
File size: 4,264 Bytes
9e6f5b7 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | ---
license: mit
license_name: deepseek-license
library_name: mlx
base_model: deepseek-ai/DeepSeek-V4-Flash
base_model_relation: quantized
pipeline_tag: text-generation
tags:
- mlx
- jang
- jangtq
- jangtq2
- jangtq-prestack
- mxtq
- deepseek
- deepseek-v4
- deepseek-v4-flash
- moe
- mla
- hash-layers
- mtp
- apple-silicon
- osaurus
---
<p align="center"><img src="osaurus-x-banner.png" width="100%" alt="OsaurusAI"/></p>
# DeepSeek-V4-Flash-JANGTQ2
**DeepSeek-V4-Flash — 79.6 GB on disk** (down from 149 GB FP4+FP8 source) —
uniform **2-bit JANGTQ** quantization on routed experts + 8-bit affine on
everything else + preserved MTP head.
- **Source:** [deepseek-ai/DeepSeek-V4-Flash](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash)
(43 transformer layers + 1 MTP head, **256 routed experts top-6 + 1
shared expert**, **3 hash layers**, MLA + mHC residuals, ~284 B total)
- **Quantization:** uniform **2-bit MXTQ** on routed-expert MLP +
8-bit affine on attention (`wq_a/wq_b/wkv/wo_a/wo_b`) / shared
expert / Compressor / Indexer / embed / lm_head / MTP. RMSNorms,
router gate, mHC fn matrices, attn_sink, ape stay fp16/fp32
passthrough.
- **Variant:** `std` (preserves MTP layer 43; one-token-per-forward
until a JANG runtime ships the accept/reject speculative-decode loop).
The companion `DeepSeek-V4-Flash-JANGTQ-K` variant drops MTP for a
smaller bundle.
- **Routed-expert layout:** **pre-stacked** along axis 0 under
`ffn.experts.switch_mlp.{{gate_proj, up_proj, down_proj}}` per the
JANGTQ-PRESTACK STANDARD. Sidecar `jangtq_runtime.safetensors`
(~24 KB) ships both `(in=2048, bits=2)` and `(in=4096, bits=2)`
codebooks + sign-flip vectors for Swift runtimes.
- **Bundle size:** **~79.6 GB on-disk**
- **Runs on:** M4 Max 128 GB / M5 Max 128 GB / Mac Studio 192 GB+
## Why top-6 + 2-bit holds
DSV4-Flash routes through **6 of 256 experts per token** plus 1 always-on
shared expert and 3 hash layers — so per-token output averages
codebook noise across 7+ pathways. That's a much weaker quality
constraint than top-1 architectures (where every token rides a single
expert's quant error). MiniMax (top-2) and Hy3-preview (top-8) both
ship coherent uniform JANGTQ2; DSV4 sits between them.
## Loading (Python)
```bash
pip install jang-tools mlx-lm
```
```python
from jang_tools.load_jangtq import load_jangtq_model
model, tokenizer = load_jangtq_model("OsaurusAI/DeepSeek-V4-Flash-JANGTQ2")
chat = tokenizer.apply_chat_template(
[{{"role": "user", "content": "What is 2 + 2? Answer briefly."}}],
tokenize=False,
add_generation_prompt=True,
)
```
`load_jangtq_model` auto-registers `model_type=deepseek_v4` via
`jang_tools.dsv4` before building the MLX skeleton. The loader applies
the DSV4-specific MLA absorb + fp32 SDPA + mHC + Compressor + Indexer
patches automatically.
## Runtime support matrix
| Surface | Status |
|---|---|
| `jang-tools` Python (`load_jangtq_model`) | ✅ working |
| `vmlx-swift-lm` Swift | ✅ working — `DeepseekV4JANGTQ` family path |
| MTP speculative decode | preserved-disabled — weights present (variant=std); accept/reject loop not yet in any JANG runtime |
## Validated runtime contract
- 43 transformer layers + 1 MTP head materialize; 40 sparse-MoE layers
hydrate routed experts via TurboQuantLinear (2-bit MXTQ).
- 33,792 MXTQ tensors / 522 affine / 706 passthrough.
- Capabilities: `family=deepseek_v4`, `reasoning_parser=deepseek_r1`,
`tool_parser=dsml`, `think_in_template=True`, `cache_type=mla`.
## Reasoning + tools
- **Reasoning parser:** `deepseek_r1`
- **Tool parser:** `dsml` (DeepSeek Markup Language — distinct from
`deepseek_tool_parser`; see `~/jang/research/DSV4-EVAL-NUANCES.md`)
- **Reasoning template:** `<|thinking_begin|>...<|thinking_end|>` blocks
via `enable_thinking=True` (default off — pass-through chat mode).
Greedy `T=0` with `enable_thinking=True` collapses into repetition on
DSV4; use `T=0.6` for pass@1 like the original DeepSeek release.
- **Cache:** `mla` (Multi-head Latent Attention with kv_lora_rank=512)
## Credits
- **Quantization + MLX runtime:** Jinho Jang (eric@osaurus.ai)
- **Source model:** DeepSeek AI
- **License:** MIT, inherited from upstream
|