DFlash Qwen3.5-27B Uncensored — NVFP4

27B hybrid linear-attention model | NVFP4 quantized | Vision + Text | DFlash speculative decoding

Performance (DGX Spark GB10)

	Without DFlash	With DFlash	Speedup
Single-stream	12.2 tok/s	33.2 tok/s	2.7x
4 concurrent	48.1 tok/s	85.5 tok/s	1.8x
8 concurrent	90.5 tok/s	92.5 tok/s	1.0x

Metric	Value
TTFT	98-138 ms
Model Size	~20 GB (NVFP4 + vision)
Memory Footprint	~22 GB loaded

Quick Links


Get Started	Step-by-step quick start guide on DGX Spark
Docker Image	`ghcr.io/aeon-7/vllm-dflash:latest`
DFlash Drafter	z-lab/Qwen3.5-27B-DFlash
Base Model	Qwen/Qwen3.5-27B
DFlash Paper	arXiv 2602.06036

Quick Start (DGX Spark)

1. Download the model

huggingface-cli download AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 \
  --local-dir ~/models/DFlash-Qwen3.5-27B-Uncensored-NVFP4

2. Create your environment file

# Auto-generate API key and create .env
cat > .env.dflash << 'EOF'
# Authentication
HF_TOKEN=hf_your_token_here
VLLM_API_KEY=$(openssl rand -hex 32)

# Model path
MODEL_HOST_PATH=~/models/DFlash-Qwen3.5-27B-Uncensored-NVFP4

# DFlash speculative decoding (auto-downloads drafter on first run)
DFLASH_DRAFTER=z-lab/Qwen3.5-27B-DFlash
DFLASH_NUM_SPEC_TOKENS=15

# DGX Spark optimal settings (64K context, 4 concurrent sequences)
MAX_MODEL_LEN=65536
MAX_NUM_SEQS=4
GPU_MEMORY_UTILIZATION=0.85
MAX_NUM_BATCHED_TOKENS=65536
EOF

# Generate a real API key and inject it
sed -i "s|\$(openssl rand -hex 32)|$(openssl rand -hex 32)|" .env.dflash
echo "Your API key: $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)"

3. Save `docker-compose.dflash.yml`

services:
  vllm-dflash:
    image: ghcr.io/aeon-7/vllm-dflash:latest
    container_name: vllm-dflash
    restart: unless-stopped
    network_mode: host
    ipc: host
    volumes:
      - ${MODEL_HOST_PATH}:/models/DFlash-Qwen3.5-27B-Uncensored-NVFP4
      - dflash-drafter-cache:/models/drafter-cache
    environment:
      - MODEL_PATH=/models/DFlash-Qwen3.5-27B-Uncensored-NVFP4
      - SERVED_MODEL_NAME=DFlash-Qwen3.5-27B-Uncensored
      - DFLASH_DRAFTER=${DFLASH_DRAFTER}
      - DFLASH_NUM_SPEC_TOKENS=${DFLASH_NUM_SPEC_TOKENS}
      - GPU_MEMORY_UTILIZATION=${GPU_MEMORY_UTILIZATION}
      - MAX_MODEL_LEN=${MAX_MODEL_LEN}
      - MAX_NUM_SEQS=${MAX_NUM_SEQS}
      - MAX_NUM_BATCHED_TOKENS=${MAX_NUM_BATCHED_TOKENS}
      - NVIDIA_VISIBLE_DEVICES=all
      - TORCH_MATMUL_PRECISION=high
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
      - HF_TOKEN=${HF_TOKEN}
      - VLLM_API_KEY=${VLLM_API_KEY}
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  dflash-drafter-cache:

4. Launch

docker compose --env-file .env.dflash -f docker-compose.dflash.yml up -d

# Watch startup (~5 min for weight loading + CUDA graph compilation)
docker compose -f docker-compose.dflash.yml logs -f

5. Test

# Text generation
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)" \
  -d '{
    "model": "DFlash-Qwen3.5-27B-Uncensored",
    "messages": [{"role": "user", "content": "Explain quantum entanglement simply."}],
    "max_tokens": 200
  }'

# Vision (image understanding)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)" \
  -d '{
    "model": "DFlash-Qwen3.5-27B-Uncensored",
    "messages": [{"role": "user", "content": [
      {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
      {"type": "text", "text": "What do you see?"}
    ]}],
    "max_tokens": 200
  }'

Environment Variables

Variable	Default	Description
`MODEL_HOST_PATH`	—	Host path to model weights
`DFLASH_DRAFTER`	`z-lab/Qwen3.5-27B-DFlash`	HF repo ID for drafter (auto-downloaded). Set `off` to disable.
`DFLASH_NUM_SPEC_TOKENS`	`15`	Tokens per draft step. 15 = fast single-stream, 5 = high concurrency.
`VLLM_API_KEY`	—	API key for LAN authentication
`HF_TOKEN`	—	HuggingFace token for gated models
`GPU_MEMORY_UTILIZATION`	`0.80`	GPU memory fraction
`MAX_MODEL_LEN`	`4096`	Max sequence length
`MAX_NUM_SEQS`	`8`	Max concurrent sequences

Performance (DGX Spark GB10)

DFlash Speculative Decoding (Measured)

Configuration	Short (200 tok)	Long (2000 tok)	Speedup
No speculation	12.2 tok/s	12.2 tok/s	1.0x
DFlash (5 spec tokens)	29.5 tok/s	25.4 tok/s	2.1-2.4x
DFlash (10 spec tokens)	28.7 tok/s	25.5 tok/s	2.1-2.4x
DFlash (15 spec tokens)	33.2 tok/s	26.3 tok/s	2.2-2.7x

Throughput Scaling

Concurrent	Aggregate tok/s	Per-Request Latency
1	33.2	6.0s
2	47.9	7.7s
4	85.5	8.3s
8	92.5	12.9s

Baseline (No Speculation)

Metric	Value
Decode Speed	12.2 tok/s
TTFT	98-138 ms
ITL (p50/p99)	81 / 88 ms

What Makes This Model Special

Why Dense Over MoE

Qwen3.5 comes in two flavors: the 122B-A10B MoE (256 experts, 10B active per token) and this 27B dense model (all parameters active on every token). The dense model has real advantages:

Higher quality per FLOP — Every one of the 27B parameters contributes to every token. MoE models route to a sparse subset, which means some experts are undertrained and routing decisions introduce noise. Dense models don't have this problem.
No routing overhead — MoE models spend compute on expert selection, load balancing, and all-to-all communication. Dense models just run the computation.
Predictable latency — No variance from different experts being selected per token. Every forward pass costs the same.
Simpler deployment — No expert parallelism concerns, no load imbalance, fits on a single GPU with NVFP4.

The tradeoff has always been speed: a 27B dense model moves all parameters through memory per token. On a memory-bandwidth-limited device like DGX Spark (273 GB/s), that meant 12 tok/s baseline. DFlash changes this entirely.

Why DFlash Makes Dense Practical on DGX Spark

The fundamental bottleneck on DGX Spark is memory bandwidth. At 273 GB/s, loading 20 GB of NVFP4 weights per token limits you to ~12 tok/s. Every dense model hits this wall.

DFlash block-diffusion speculative decoding breaks through it:

The 2B drafter proposes multiple tokens simultaneously — one diffusion forward pass generates an entire block of speculative tokens in parallel, not sequentially. This costs roughly the same as generating a single token.
The 27B target verifies all proposed tokens in one forward pass — instead of paying the full memory bandwidth cost per token, you pay it once and produce 3-4 accepted tokens on average.
Net effect: you amortize the bandwidth cost across multiple tokens per forward pass.

The result on DGX Spark:

	Without DFlash	With DFlash
Single-stream	12.2 tok/s	33.2 tok/s
Effective bandwidth utilization	1 token per pass	~3.5 tokens per pass
Practical feel	Sluggish, noticeable delay	Responsive, fluid

This makes the 27B dense model faster than the 122B MoE on a single DGX Spark while delivering the quality advantages of a dense architecture. DFlash turns the DGX Spark from "it can run a 27B model" into "it runs a 27B model well."

Hybrid Architecture

Qwen3.5-27B uses a hybrid architecture mixing two attention types across 64 layers:

Linear attention (GDN) — Gated Delta Network layers for efficient long-context processing with O(1) per-token state (48 layers)
Full attention — Standard multi-head attention every 4th layer for global context capture (16 layers)

This gives near-linear scaling with sequence length while maintaining full-attention quality at key intervals.

Vision + Text

The model includes a 27-layer ViT vision encoder (460M params, BF16) with a merger that projects visual features into the language model's hidden space. Supports image understanding alongside text generation.

DFlash Block-Diffusion Speculative Decoding

z-lab/Qwen3.5-27B-DFlash is a 2B block-diffusion drafter that generates all speculative tokens simultaneously in a single diffusion step (not sequentially like standard speculative decoding). The 27B target model then verifies in one pass, achieving 2-5x speedup with zero quality loss.

Key difference from standard spec decode: drafting cost is ~constant regardless of token count (one diffusion forward pass), so the tradeoff is purely about verification overhead vs acceptance rate.

AWQ_FULL Quantization

This model uses the most thorough NVFP4 quantization pipeline available:

AWQ_FULL — Exhaustive grid search with alpha_step=0.1 across 10 scaling factors per layer, plus a second awq_clip pass that optimizes clipping ratios
Full NVFP4 Quantization — All attention projections (Q/K/V/O) and all MLP layers (gate/up/down) quantized to FP4. Excludes: vision tower, embeddings, norms, and lm_head
Pre-quantization scales — Channel-wise BF16 factors that redistribute weight magnitudes before quantization

Model Details

Property	Value
Architecture	Qwen3.5 (Hybrid, 27B parameters)
Layers	64 (48 GDN + 16 full-attention)
Hidden Size	5120
Attention Heads	24 (4 KV heads), head_dim=256
Vision Encoder	27-layer ViT, 460M params (BF16)
Max Context	131,072 tokens
Vocabulary	248,320 tokens
Quantization	NVFP4 AWQ_FULL (ModelOpt 0.43.0)
Model Size	~20 GB (quantized + vision)

NVFP4 Weight Format

Each quantized layer stores:

weight (uint8) — packed FP4 E2M1 pairs (16-element blocks)
weight_scale (float8_e4m3fn) — per-block scale (1 per 16 elements)
weight_scale_2 (float32) — per-tensor global scale
pre_quant_scale (bfloat16) — AWQ per-channel pre-scaling factors
input_scale (float32) — static activation scale from calibration

Optimization Stack

Optimization	Status
torch.compile (inductor)	Active
CUDA graphs (FULL + PIECEWISE)	Active
FlashInfer CUTLASS FP4 GEMM	Autotuned for GB10
Flash Attention v2	Active
Triton/FLA GDN prefill kernel	Active
FP8 KV cache	Active (BF16 when DFlash enabled)
Chunked prefill	Active
Prefix caching	Active
Act-quant fusion	Active

Alternative Deployment Methods

vLLM (Manual)

vllm serve AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 \
  --quantization modelopt \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}' \
  --attention-backend flash_attn \
  --kv-cache-dtype auto \
  --gpu-memory-utilization 0.80 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 8 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --trust-remote-code

Note: DFlash uses non-causal attention, which requires --kv-cache-dtype auto (BF16). FP8 KV cache is incompatible with DFlash.

SGLang

python -m sglang.launch_server \
  --model-path AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 \
  --speculative-algorithm DFLASH \
  --speculative-draft-model-path z-lab/Qwen3.5-27B-DFlash \
  --speculative-num-draft-tokens 16 \
  --tp-size 1 \
  --attention-backend fa3 \
  --mem-fraction-static 0.75 \
  --mamba-scheduler-strategy extra_buffer \
  --trust-remote-code

Credits

Base model by Qwen Team
DFlash speculative decoding by z-lab (paper)
Abliteration using llm-abliteration
NVFP4 quantization with NVIDIA ModelOpt
Release by AEON-7

Legal Disclaimer

THIS MODEL IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. This model has had safety alignment removed. Users are responsible for ensuring ethical and legal use.

Downloads last month: 1,174

Model tree for AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4

Base model

Qwen/Qwen3.5-27B

Finetuned

AEON-7/DFlash-Qwen3.5-27B-Uncensored

Quantized

(1)

this model

Paper for AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4

DFlash: Block Diffusion for Flash Speculative Decoding

Paper • 2602.06036 • Published Feb 5 • 55

DFlash Qwen3.5-27B Uncensored — NVFP4

Performance (DGX Spark GB10)

Quick Links

Quick Start (DGX Spark)

1. Download the model

2. Create your environment file

3. Save docker-compose.dflash.yml

4. Launch

5. Test

Environment Variables

Performance (DGX Spark GB10)

DFlash Speculative Decoding (Measured)

Throughput Scaling

Baseline (No Speculation)

What Makes This Model Special

Why Dense Over MoE

Why DFlash Makes Dense Practical on DGX Spark

Hybrid Architecture

Vision + Text

DFlash Block-Diffusion Speculative Decoding

AWQ_FULL Quantization

Model Details

NVFP4 Weight Format

Optimization Stack

Alternative Deployment Methods

vLLM (Manual)

SGLang

Credits

Legal Disclaimer

Model tree for AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4

Paper for AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4

3. Save `docker-compose.dflash.yml`