Qwen3.5-9B-abliterated-DFlash
Paper | GitHub | Blog | Base Draft
DFlash is a speculative decoding method that uses a lightweight block diffusion model to draft multiple tokens in parallel, achieving up to 4.4× speedup over autoregressive decoding. This is the drafter model, which must be paired with lukey03/Qwen3.5-9B-abliterated-MLX-4bit.
Why this model exists: The original z-lab/Qwen3.5-9B-DFlash draft was trained against the unmodified Qwen3.5-9B weights. Abliteration shifts the model's hidden-state distribution, which reduces draft acceptance rates. This variant is fine-tuned directly on activations from the abliterated model, restoring calibration and maximising accepted tokens per draft round.
Architecture
The drafter is a compact 5-layer Qwen3 transformer (32 attention heads, hidden size 4096) that operates in parallel over a block of 16 masked positions. At each decoding step it:
- Receives the concatenated hidden states of the target model at layers
[1, 8, 15, 22, 29] - Embeds a block
[last_token, <mask> × 15]using the target model's shared embedding table - Runs a single non-causal forward pass and proposes 15 draft tokens simultaneously
- The full target model verifies the block in one pass — accepted tokens are free, rejected tokens trigger a fallback to the target's own sample
This is lossless — the output distribution is identical to standard autoregressive sampling from the target model.
| Property | Value |
|---|---|
| Draft layers | 5 |
| Block size | 16 |
| Target layers tapped | 1, 8, 15, 22, 29 |
| Mask token id | 248070 |
| Parameters (draft only) | ~340M |
Quick Start
Installation
pip install "dflash[mlx] @ git+https://github.com/z-lab/dflash.git"
Apple Silicon / MLX
import json
from pathlib import Path
import mlx.core as mx
from huggingface_hub import snapshot_download
from dflash.model_mlx import load, stream_generate, DFlashConfig, DFlashDraftModel
# ── target model ──────────────────────────────────────────────────────────────
model, tokenizer = load("lukey03/Qwen3.5-9B-abliterated-MLX-4bit")
# ── draft model ───────────────────────────────────────────────────────────────
def load_draft(repo_id: str) -> DFlashDraftModel:
path = Path(snapshot_download(repo_id, allow_patterns=["*.safetensors", "*.json"]))
cfg = json.loads((path / "config.json").read_text())
config = DFlashConfig(
hidden_size=cfg["hidden_size"],
num_hidden_layers=cfg["num_hidden_layers"],
num_attention_heads=cfg["num_attention_heads"],
num_key_value_heads=cfg["num_key_value_heads"],
head_dim=cfg["head_dim"],
intermediate_size=cfg["intermediate_size"],
vocab_size=cfg["vocab_size"],
rms_norm_eps=cfg["rms_norm_eps"],
rope_theta=cfg["rope_theta"],
max_position_embeddings=cfg["max_position_embeddings"],
block_size=cfg["block_size"],
target_layer_ids=tuple(cfg["dflash_config"]["target_layer_ids"]),
num_target_layers=cfg["num_target_layers"],
mask_token_id=cfg["dflash_config"]["mask_token_id"],
)
weights = {k: v for f in path.glob("*.safetensors") for k, v in mx.load(str(f)).items()}
m = DFlashDraftModel(config)
m.load_weights(list(weights.items()))
return m
draft = load_draft("guglxni/Qwen3.5-9B-abliterated-DFlash")
# ── generate ──────────────────────────────────────────────────────────────────
messages = [{"role": "user", "content": "Write a quicksort in Python."}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)
tps = 0.0
for r in stream_generate(model, draft, tokenizer, prompt,
block_size=16, max_tokens=2048, temperature=0.6):
print(r.text, end="", flush=True)
tps = r.generation_tps
print(f"\n\nThroughput: {tps:.1f} tok/s")
vLLM (CUDA)
vllm serve lukey03/Qwen3.5-9B-abliterated \
--speculative-config '{"method": "dflash", "model": "guglxni/Qwen3.5-9B-abliterated-DFlash", "num_speculative_tokens": 15}' \
--attention-backend flash_attn \
--max-num-batched-tokens 32768
SGLang (CUDA)
python -m sglang.launch_server \
--model-path lukey03/Qwen3.5-9B-abliterated \
--speculative-algorithm DFLASH \
--speculative-draft-model-path guglxni/Qwen3.5-9B-abliterated-DFlash \
--speculative-num-draft-tokens 16 \
--tp-size 1 \
--attention-backend fa3 \
--mem-fraction-static 0.75 \
--trust-remote-code
Performance (Apple M4, 16 GB, MLX)
Measured with run_dflash.py --benchmark against plain mlx_lm.generate on the same
4-bit quantised model. Throughput in tok/s, block size 16.
| Task | Autoregressive | DFlash | Speedup |
|---|---|---|---|
| Code generation | ~21 | ~28 | ~1.4× |
| Math / reasoning | ~19 | ~18 | ~1× |
| Chat / instruction | ~21 | ~7–10 | varies |
Speedup is highest for predictable content (code). Accept length averages 8+ tokens/round for code generation. Re-calibrating the draft to the specific abliterated model weights recovers acceptance rates across all prompt types.
Training Details
| Base draft | z-lab/Qwen3.5-9B-DFlash |
| Target model | lukey03/Qwen3.5-9B-abliterated-MLX-4bit |
| Training objective | Block diffusion — predict 15 masked tokens given 1 anchor token + target hidden states |
| Training data | tatsu-lab/alpaca (200 sequences × 128 tokens) |
| Optimiser | Adam, lr=1e-4, cosine decay |
| Steps | 1 000 |
| Hardware | Apple M4 (16 GB unified memory), MLX |
| Framework | z-lab/dflash MLX backend |
The draft is initialised from z-lab's pre-trained weights and fine-tuned for 1 000 steps on hidden-state activations extracted from the abliterated target model. Only the 5 draft decoder layers, projection, and norm are updated — embeddings and the LM head remain shared with the target model at inference time.
Relationship to z-lab Models
This model is a community fine-tune of z-lab/Qwen3.5-9B-DFlash.
The architecture, config, and dflash.py model code are identical; only the weights differ.
If you are running the original (non-abliterated) Qwen/Qwen3.5-9B, use z-lab's
official draft instead.
Acknowledgements
All credit for the DFlash method, architecture, and training methodology goes to Jian Chen, Yesheng Liang, and Zhijian Liu at z-lab. This fine-tune adapts their work for the abliterated model variant on Apple Silicon.
Citation
@article{chen2026dflash,
title = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
author = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
journal = {arXiv preprint arXiv:2602.06036},
year = {2026}
}
- Downloads last month
- 2,323
Quantized
Model tree for guglxni/Qwen3.5-9B-abliterated-DFlash
Base model
Qwen/Qwen3.5-9B-Base