Qwen3.5-9B-abliterated-DFlash

Paper | GitHub | Blog | Base Draft

DFlash is a speculative decoding method that uses a lightweight block diffusion model to draft multiple tokens in parallel, achieving up to 4.4× speedup over autoregressive decoding. This is the drafter model, which must be paired with lukey03/Qwen3.5-9B-abliterated-MLX-4bit.

Why this model exists: The original z-lab/Qwen3.5-9B-DFlash draft was trained against the unmodified Qwen3.5-9B weights. Abliteration shifts the model's hidden-state distribution, which reduces draft acceptance rates. This variant is fine-tuned directly on activations from the abliterated model, restoring calibration and maximising accepted tokens per draft round.

Architecture

The drafter is a compact 5-layer Qwen3 transformer (32 attention heads, hidden size 4096) that operates in parallel over a block of 16 masked positions. At each decoding step it:

  1. Receives the concatenated hidden states of the target model at layers [1, 8, 15, 22, 29]
  2. Embeds a block [last_token, <mask> × 15] using the target model's shared embedding table
  3. Runs a single non-causal forward pass and proposes 15 draft tokens simultaneously
  4. The full target model verifies the block in one pass — accepted tokens are free, rejected tokens trigger a fallback to the target's own sample

This is lossless — the output distribution is identical to standard autoregressive sampling from the target model.

Property Value
Draft layers 5
Block size 16
Target layers tapped 1, 8, 15, 22, 29
Mask token id 248070
Parameters (draft only) ~340M

Quick Start

Installation

pip install "dflash[mlx] @ git+https://github.com/z-lab/dflash.git"

Apple Silicon / MLX

import json
from pathlib import Path
import mlx.core as mx
from huggingface_hub import snapshot_download
from dflash.model_mlx import load, stream_generate, DFlashConfig, DFlashDraftModel

# ── target model ──────────────────────────────────────────────────────────────
model, tokenizer = load("lukey03/Qwen3.5-9B-abliterated-MLX-4bit")

# ── draft model ───────────────────────────────────────────────────────────────
def load_draft(repo_id: str) -> DFlashDraftModel:
    path = Path(snapshot_download(repo_id, allow_patterns=["*.safetensors", "*.json"]))
    cfg  = json.loads((path / "config.json").read_text())
    config = DFlashConfig(
        hidden_size=cfg["hidden_size"],
        num_hidden_layers=cfg["num_hidden_layers"],
        num_attention_heads=cfg["num_attention_heads"],
        num_key_value_heads=cfg["num_key_value_heads"],
        head_dim=cfg["head_dim"],
        intermediate_size=cfg["intermediate_size"],
        vocab_size=cfg["vocab_size"],
        rms_norm_eps=cfg["rms_norm_eps"],
        rope_theta=cfg["rope_theta"],
        max_position_embeddings=cfg["max_position_embeddings"],
        block_size=cfg["block_size"],
        target_layer_ids=tuple(cfg["dflash_config"]["target_layer_ids"]),
        num_target_layers=cfg["num_target_layers"],
        mask_token_id=cfg["dflash_config"]["mask_token_id"],
    )
    weights = {k: v for f in path.glob("*.safetensors") for k, v in mx.load(str(f)).items()}
    m = DFlashDraftModel(config)
    m.load_weights(list(weights.items()))
    return m

draft = load_draft("guglxni/Qwen3.5-9B-abliterated-DFlash")

# ── generate ──────────────────────────────────────────────────────────────────
messages = [{"role": "user", "content": "Write a quicksort in Python."}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
)

tps = 0.0
for r in stream_generate(model, draft, tokenizer, prompt,
                          block_size=16, max_tokens=2048, temperature=0.6):
    print(r.text, end="", flush=True)
    tps = r.generation_tps

print(f"\n\nThroughput: {tps:.1f} tok/s")

vLLM (CUDA)

vllm serve lukey03/Qwen3.5-9B-abliterated \
  --speculative-config '{"method": "dflash", "model": "guglxni/Qwen3.5-9B-abliterated-DFlash", "num_speculative_tokens": 15}' \
  --attention-backend flash_attn \
  --max-num-batched-tokens 32768

SGLang (CUDA)

python -m sglang.launch_server \
    --model-path lukey03/Qwen3.5-9B-abliterated \
    --speculative-algorithm DFLASH \
    --speculative-draft-model-path guglxni/Qwen3.5-9B-abliterated-DFlash \
    --speculative-num-draft-tokens 16 \
    --tp-size 1 \
    --attention-backend fa3 \
    --mem-fraction-static 0.75 \
    --trust-remote-code

Performance (Apple M4, 16 GB, MLX)

Measured with run_dflash.py --benchmark against plain mlx_lm.generate on the same 4-bit quantised model. Throughput in tok/s, block size 16.

Task Autoregressive DFlash Speedup
Code generation ~21 ~28 ~1.4×
Math / reasoning ~19 ~18 ~1×
Chat / instruction ~21 ~7–10 varies

Speedup is highest for predictable content (code). Accept length averages 8+ tokens/round for code generation. Re-calibrating the draft to the specific abliterated model weights recovers acceptance rates across all prompt types.

Training Details

Base draft z-lab/Qwen3.5-9B-DFlash
Target model lukey03/Qwen3.5-9B-abliterated-MLX-4bit
Training objective Block diffusion — predict 15 masked tokens given 1 anchor token + target hidden states
Training data tatsu-lab/alpaca (200 sequences × 128 tokens)
Optimiser Adam, lr=1e-4, cosine decay
Steps 1 000
Hardware Apple M4 (16 GB unified memory), MLX
Framework z-lab/dflash MLX backend

The draft is initialised from z-lab's pre-trained weights and fine-tuned for 1 000 steps on hidden-state activations extracted from the abliterated target model. Only the 5 draft decoder layers, projection, and norm are updated — embeddings and the LM head remain shared with the target model at inference time.

Relationship to z-lab Models

This model is a community fine-tune of z-lab/Qwen3.5-9B-DFlash. The architecture, config, and dflash.py model code are identical; only the weights differ. If you are running the original (non-abliterated) Qwen/Qwen3.5-9B, use z-lab's official draft instead.

Acknowledgements

All credit for the DFlash method, architecture, and training methodology goes to Jian Chen, Yesheng Liang, and Zhijian Liu at z-lab. This fine-tune adapts their work for the abliterated model variant on Apple Silicon.

Citation

@article{chen2026dflash,
  title   = {{DFlash: Block Diffusion for Flash Speculative Decoding}},
  author  = {Chen, Jian and Liang, Yesheng and Liu, Zhijian},
  journal = {arXiv preprint arXiv:2602.06036},
  year    = {2026}
}
Downloads last month
2,323
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for guglxni/Qwen3.5-9B-abliterated-DFlash

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(1)
this model

Paper for guglxni/Qwen3.5-9B-abliterated-DFlash