DFlash Qwen3.5-27B Uncensored — NVFP4

27B hybrid linear-attention model | NVFP4 quantized | Vision + Text | DFlash speculative decoding

Performance (DGX Spark GB10)

Without DFlash With DFlash Speedup
Single-stream 12.2 tok/s 33.2 tok/s 2.7x
4 concurrent 48.1 tok/s 85.5 tok/s 1.8x
8 concurrent 90.5 tok/s 92.5 tok/s 1.0x
Metric Value
TTFT 98-138 ms
Model Size ~20 GB (NVFP4 + vision)
Memory Footprint ~22 GB loaded

Quick Links


Quick Start (DGX Spark)

1. Download the model

huggingface-cli download AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 \
  --local-dir ~/models/DFlash-Qwen3.5-27B-Uncensored-NVFP4

2. Create your environment file

# Auto-generate API key and create .env
cat > .env.dflash << 'EOF'
# Authentication
HF_TOKEN=hf_your_token_here
VLLM_API_KEY=$(openssl rand -hex 32)

# Model path
MODEL_HOST_PATH=~/models/DFlash-Qwen3.5-27B-Uncensored-NVFP4

# DFlash speculative decoding (auto-downloads drafter on first run)
DFLASH_DRAFTER=z-lab/Qwen3.5-27B-DFlash
DFLASH_NUM_SPEC_TOKENS=15

# DGX Spark optimal settings (64K context, 4 concurrent sequences)
MAX_MODEL_LEN=65536
MAX_NUM_SEQS=4
GPU_MEMORY_UTILIZATION=0.85
MAX_NUM_BATCHED_TOKENS=65536
EOF

# Generate a real API key and inject it
sed -i "s|\$(openssl rand -hex 32)|$(openssl rand -hex 32)|" .env.dflash
echo "Your API key: $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)"

3. Save docker-compose.dflash.yml

services:
  vllm-dflash:
    image: ghcr.io/aeon-7/vllm-dflash:latest
    container_name: vllm-dflash
    restart: unless-stopped
    network_mode: host
    ipc: host
    volumes:
      - ${MODEL_HOST_PATH}:/models/DFlash-Qwen3.5-27B-Uncensored-NVFP4
      - dflash-drafter-cache:/models/drafter-cache
    environment:
      - MODEL_PATH=/models/DFlash-Qwen3.5-27B-Uncensored-NVFP4
      - SERVED_MODEL_NAME=DFlash-Qwen3.5-27B-Uncensored
      - DFLASH_DRAFTER=${DFLASH_DRAFTER}
      - DFLASH_NUM_SPEC_TOKENS=${DFLASH_NUM_SPEC_TOKENS}
      - GPU_MEMORY_UTILIZATION=${GPU_MEMORY_UTILIZATION}
      - MAX_MODEL_LEN=${MAX_MODEL_LEN}
      - MAX_NUM_SEQS=${MAX_NUM_SEQS}
      - MAX_NUM_BATCHED_TOKENS=${MAX_NUM_BATCHED_TOKENS}
      - NVIDIA_VISIBLE_DEVICES=all
      - TORCH_MATMUL_PRECISION=high
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
      - HF_TOKEN=${HF_TOKEN}
      - VLLM_API_KEY=${VLLM_API_KEY}
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  dflash-drafter-cache:

4. Launch

docker compose --env-file .env.dflash -f docker-compose.dflash.yml up -d

# Watch startup (~5 min for weight loading + CUDA graph compilation)
docker compose -f docker-compose.dflash.yml logs -f

5. Test

# Text generation
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)" \
  -d '{
    "model": "DFlash-Qwen3.5-27B-Uncensored",
    "messages": [{"role": "user", "content": "Explain quantum entanglement simply."}],
    "max_tokens": 200
  }'

# Vision (image understanding)
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $(grep VLLM_API_KEY .env.dflash | cut -d= -f2)" \
  -d '{
    "model": "DFlash-Qwen3.5-27B-Uncensored",
    "messages": [{"role": "user", "content": [
      {"type": "image_url", "image_url": {"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Cat03.jpg/1200px-Cat03.jpg"}},
      {"type": "text", "text": "What do you see?"}
    ]}],
    "max_tokens": 200
  }'

Environment Variables

Variable Default Description
MODEL_HOST_PATH Host path to model weights
DFLASH_DRAFTER z-lab/Qwen3.5-27B-DFlash HF repo ID for drafter (auto-downloaded). Set off to disable.
DFLASH_NUM_SPEC_TOKENS 15 Tokens per draft step. 15 = fast single-stream, 5 = high concurrency.
VLLM_API_KEY API key for LAN authentication
HF_TOKEN HuggingFace token for gated models
GPU_MEMORY_UTILIZATION 0.80 GPU memory fraction
MAX_MODEL_LEN 4096 Max sequence length
MAX_NUM_SEQS 8 Max concurrent sequences

Performance (DGX Spark GB10)

DFlash Speculative Decoding (Measured)

Configuration Short (200 tok) Long (2000 tok) Speedup
No speculation 12.2 tok/s 12.2 tok/s 1.0x
DFlash (5 spec tokens) 29.5 tok/s 25.4 tok/s 2.1-2.4x
DFlash (10 spec tokens) 28.7 tok/s 25.5 tok/s 2.1-2.4x
DFlash (15 spec tokens) 33.2 tok/s 26.3 tok/s 2.2-2.7x

Throughput Scaling

Concurrent Aggregate tok/s Per-Request Latency
1 33.2 6.0s
2 47.9 7.7s
4 85.5 8.3s
8 92.5 12.9s

Baseline (No Speculation)

Metric Value
Decode Speed 12.2 tok/s
TTFT 98-138 ms
ITL (p50/p99) 81 / 88 ms

What Makes This Model Special

Why Dense Over MoE

Qwen3.5 comes in two flavors: the 122B-A10B MoE (256 experts, 10B active per token) and this 27B dense model (all parameters active on every token). The dense model has real advantages:

  • Higher quality per FLOP — Every one of the 27B parameters contributes to every token. MoE models route to a sparse subset, which means some experts are undertrained and routing decisions introduce noise. Dense models don't have this problem.
  • No routing overhead — MoE models spend compute on expert selection, load balancing, and all-to-all communication. Dense models just run the computation.
  • Predictable latency — No variance from different experts being selected per token. Every forward pass costs the same.
  • Simpler deployment — No expert parallelism concerns, no load imbalance, fits on a single GPU with NVFP4.

The tradeoff has always been speed: a 27B dense model moves all parameters through memory per token. On a memory-bandwidth-limited device like DGX Spark (273 GB/s), that meant 12 tok/s baseline. DFlash changes this entirely.

Why DFlash Makes Dense Practical on DGX Spark

The fundamental bottleneck on DGX Spark is memory bandwidth. At 273 GB/s, loading 20 GB of NVFP4 weights per token limits you to ~12 tok/s. Every dense model hits this wall.

DFlash block-diffusion speculative decoding breaks through it:

  1. The 2B drafter proposes multiple tokens simultaneously — one diffusion forward pass generates an entire block of speculative tokens in parallel, not sequentially. This costs roughly the same as generating a single token.
  2. The 27B target verifies all proposed tokens in one forward pass — instead of paying the full memory bandwidth cost per token, you pay it once and produce 3-4 accepted tokens on average.
  3. Net effect: you amortize the bandwidth cost across multiple tokens per forward pass.

The result on DGX Spark:

Without DFlash With DFlash
Single-stream 12.2 tok/s 33.2 tok/s
Effective bandwidth utilization 1 token per pass ~3.5 tokens per pass
Practical feel Sluggish, noticeable delay Responsive, fluid

This makes the 27B dense model faster than the 122B MoE on a single DGX Spark while delivering the quality advantages of a dense architecture. DFlash turns the DGX Spark from "it can run a 27B model" into "it runs a 27B model well."

Hybrid Architecture

Qwen3.5-27B uses a hybrid architecture mixing two attention types across 64 layers:

  • Linear attention (GDN) — Gated Delta Network layers for efficient long-context processing with O(1) per-token state (48 layers)
  • Full attention — Standard multi-head attention every 4th layer for global context capture (16 layers)

This gives near-linear scaling with sequence length while maintaining full-attention quality at key intervals.

Vision + Text

The model includes a 27-layer ViT vision encoder (460M params, BF16) with a merger that projects visual features into the language model's hidden space. Supports image understanding alongside text generation.

DFlash Block-Diffusion Speculative Decoding

z-lab/Qwen3.5-27B-DFlash is a 2B block-diffusion drafter that generates all speculative tokens simultaneously in a single diffusion step (not sequentially like standard speculative decoding). The 27B target model then verifies in one pass, achieving 2-5x speedup with zero quality loss.

Key difference from standard spec decode: drafting cost is ~constant regardless of token count (one diffusion forward pass), so the tradeoff is purely about verification overhead vs acceptance rate.

AWQ_FULL Quantization

This model uses the most thorough NVFP4 quantization pipeline available:

  1. AWQ_FULL — Exhaustive grid search with alpha_step=0.1 across 10 scaling factors per layer, plus a second awq_clip pass that optimizes clipping ratios
  2. Full NVFP4 Quantization — All attention projections (Q/K/V/O) and all MLP layers (gate/up/down) quantized to FP4. Excludes: vision tower, embeddings, norms, and lm_head
  3. Pre-quantization scales — Channel-wise BF16 factors that redistribute weight magnitudes before quantization

Model Details

Property Value
Architecture Qwen3.5 (Hybrid, 27B parameters)
Layers 64 (48 GDN + 16 full-attention)
Hidden Size 5120
Attention Heads 24 (4 KV heads), head_dim=256
Vision Encoder 27-layer ViT, 460M params (BF16)
Max Context 131,072 tokens
Vocabulary 248,320 tokens
Quantization NVFP4 AWQ_FULL (ModelOpt 0.43.0)
Model Size ~20 GB (quantized + vision)

NVFP4 Weight Format

Each quantized layer stores:

  • weight (uint8) — packed FP4 E2M1 pairs (16-element blocks)
  • weight_scale (float8_e4m3fn) — per-block scale (1 per 16 elements)
  • weight_scale_2 (float32) — per-tensor global scale
  • pre_quant_scale (bfloat16) — AWQ per-channel pre-scaling factors
  • input_scale (float32) — static activation scale from calibration

Optimization Stack

Optimization Status
torch.compile (inductor) Active
CUDA graphs (FULL + PIECEWISE) Active
FlashInfer CUTLASS FP4 GEMM Autotuned for GB10
Flash Attention v2 Active
Triton/FLA GDN prefill kernel Active
FP8 KV cache Active (BF16 when DFlash enabled)
Chunked prefill Active
Prefix caching Active
Act-quant fusion Active

Alternative Deployment Methods

vLLM (Manual)

vllm serve AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 \
  --quantization modelopt \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-27B-DFlash", "num_speculative_tokens": 15}' \
  --attention-backend flash_attn \
  --kv-cache-dtype auto \
  --gpu-memory-utilization 0.80 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 8 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --trust-remote-code

Note: DFlash uses non-causal attention, which requires --kv-cache-dtype auto (BF16). FP8 KV cache is incompatible with DFlash.

SGLang

python -m sglang.launch_server \
  --model-path AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4 \
  --speculative-algorithm DFLASH \
  --speculative-draft-model-path z-lab/Qwen3.5-27B-DFlash \
  --speculative-num-draft-tokens 16 \
  --tp-size 1 \
  --attention-backend fa3 \
  --mem-fraction-static 0.75 \
  --mamba-scheduler-strategy extra_buffer \
  --trust-remote-code

Credits

Legal Disclaimer

THIS MODEL IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. This model has had safety alignment removed. Users are responsible for ensuring ethical and legal use.

Downloads last month
1,174
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4

Base model

Qwen/Qwen3.5-27B
Quantized
(1)
this model

Paper for AEON-7/DFlash-Qwen3.5-27B-Uncensored-NVFP4