Ling 2.6 Flash — MXFP4 + CRACK
MXFP4 quantized | CRACK abliterated | Hybrid MLA + Linear-Attn MoE | EN + ZH | 63 GB
What Is This?
This is Ling 2.6 Flash by inclusionAI — a 35B-parameter Mixture-of-Experts model with 256 routed experts (8 active per token) + 1 always-active shared expert, hybrid MLA + Lightning Linear-Attention architecture, native English + Chinese, 131K context.
It has been:
- MXFP4 quantized — uniform 4-bit affine, group_size=32 — 63 GB
- CRACK abliterated — permanent weight-level removal of safety refusal
| Base model | inclusionAI/Ling-2.6-flash (35B total, 1 shared + 8 routed active) |
| Architecture | bailing_hybrid — Multi-Latent Attention (MLA) every 8th layer + Lightning Linear-Attn elsewhere |
| Quantization | MXFP4 (Q4 g=32 affine) — 63 GB |
| MMLU-200 | 78.5% (base 80.0% — within −1.5pp) |
| HarmBench-320 | 97.8% comply (base 50.3% — +47.5pp) |
| Context | 131,072 native |
| Languages | English + Chinese (probed bilingual) |
| Speed | 30+ tok/s on M4 Max 128 GB |
| Fits on | 96 GB+ Macs |
MMLU-200 Results (thinking OFF)
| Model | Correct | Accuracy | No-match |
|---|---|---|---|
| MXFP4 Base | 160/200 | 80.00% | 6 |
| MXFP4 + CRACK | 157/200 | 78.50% | 10 |
| Δ | −3 | −1.5pp | +4 |
CRACK delta of −1.5pp is well inside the noise floor for a 200-question sample — capability essentially preserved.
HarmBench-320 Results
| Model | COMPLY | REFUSE | EMPTY |
|---|---|---|---|
| MXFP4 Base | 161 (50.3%) | 157 (49.1%) | 2 (0.6%) |
| MXFP4 + CRACK | 313 (97.8%) | 5 (1.6%) | 2 (0.6%) |
| Δ comply | +47.5pp |
Refusal directional removal lifts HarmBench compliance from ~50% to ~98% with negligible MMLU regression.
Ling 2.6 Flash CRACK Series
| Model | Format | Size | MMLU-200 | HarmBench-320 | Fits on |
|---|---|---|---|---|---|
| MXFP4 + CRACK (this model) | affine 4-bit g=32 | 63 GB | 78.5% | 97.8% | 96 GB Mac |
| JANGTQ2 + CRACK | TurboQuant 2-bit experts + 8-bit affine | 29 GB | 81.0% | 100.0% | 48 GB Mac |
The JANGTQ2 variant is smaller and scores higher on both benchmarks — quant noise on the 2-bit routed experts ends up helping rather than hurting.
Usage
from mlx_lm import load, generate
model, tokenizer = load("dealignai/Ling-2.6-flash-MXFP4-CRACK")
messages = [{"role": "user", "content": "Hello — what can you do?"}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
print(generate(model, tokenizer, prompt=prompt, max_tokens=400, verbose=True))
mlx_lm >= 0.20 with bailing_hybrid model class is required.
About This Model
Ling 2.6 Flash is the latency-tier sibling in the Ling 2.6 family — fast multilingual instruction-follow + tool use. The chat template includes a <think>...</think> reasoning block, but in practice this Flash variant is best treated as a non-reasoning instruct model: leave thinking OFF (the default) for benchmark-style work and short-form responses. For chain-of-thought reasoning prefer the larger Ling 2.6 Plus / Ring / Pro tier.
CRACK is a permanent weight-level abliteration that removes safety refusal from the always-active residual-stream writers. Multilingual (EN + ZH) refusal direction extraction means the model complies on both English and Chinese prompts. The vision tower (none here — Ling 2.6 Flash is text-only) and MoE routing/expert internals are untouched.
Support dealignai
All models are built from original research and published for free.
Support us on Ko-fi — check out the Ko-fi membership for early access and extras.
Disclaimer
This model has had its safety refusal circuits removed. It will produce responses that would normally be refused, including technical content on security testing, dual-use research, and sensitive topics. You are responsible for how you use it.
The CRACK abliteration process does not add new capabilities — it only removes the model's learned refusal patterns. All knowledge, including the knowledge used to produce unsafe outputs, was already present in the base Ling model.
- Downloads last month
- 30
Quantized
Model tree for dealignai/Ling-2.6-flash-MXFP4-CRACK
Base model
inclusionAI/Ling-2.6-flash