Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound

English | 繁體中文


English

INT4 AutoRound quantization of huihui-ai/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated, optimized for NVIDIA DGX Spark (GB10 SM121) with Marlin INT4 kernel acceleration.

Model Details

Item Value
Architecture MoE (35B total, 3B active) + GDN (Mamba) + Attention hybrid
Base model Qwen/Qwen3.5-35B-A3B
Fine-tuned by huihui-ai (Claude 4.6 Opus distillation + abliteration)
Quantized by YuYu1015
Model size ~23 GB (vs ~70 GB BF16 original)
Context length Up to 65,536 tokens (limited by BF16 KV cache on 128GB)
Thinking mode Supported (enable_thinking: true/false)
Tool calling Supported (qwen3_coder parser)
MTP Built-in MTP weights included

Quantization Details

Item Value
Method Intel AutoRound v0.12.2
Bits 4
Group size 128
Format auto_round (GPTQ-compatible)
Iterations 200
Calibration samples 512
Calibration sequence length 2048
Hardware NVIDIA DGX Spark (GB10, 128GB unified memory)

Layers Preserved in BF16

The following layers are not quantized to preserve model quality:

Layer Reason
lm_head Output head, sensitive to quantization noise
embed_tokens Input embeddings (auto-excluded by shape)
shared_expert.* Shared expert weights, processes every token
shared_expert_gate Shared expert routing gate
mlp.gate MoE routing gate (auto-excluded)
linear_attn.* GDN/DeltaNet layers, may output zeros if quantized
model.visual.* Vision encoder (auto-excluded by shape)
mtp.* Multi-Token Prediction weights (copied as RTN INT4)

Performance

Tested on a single NVIDIA DGX Spark (GB10, 128GB LPDDR5X, SM121):

Configuration Decode Speed Notes
INT4 + DFlash-16 (daily conversation) 40-60 tok/s With z-lab/Qwen3.5-35B-A3B-DFlash

Speculative Decoding

This model supports two speculative decoding methods:

DFlash (requires separate drafter model):

--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-35B-A3B-DFlash", "num_speculative_tokens": 16}'

Note: The DFlash drafter was trained on the original Qwen3.5-35B-A3B. Acceptance rate on the abliterated variant may be lower than on the original model.

MTP (uses built-in weights, no extra model needed):

--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

Serving with vLLM

vllm serve /path/to/model \
    --quantization moe_wna16 \
    --served-model-name qwen3.5-35b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder \
    --kv-cache-dtype auto \
    --gpu-memory-utilization 0.90 \
    --max-model-len 65536 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --trust-remote-code \
    --language-model-only

DGX Spark (SM121) Compatibility Notes

  • Use --quantization moe_wna16 for Marlin INT4 kernel (SM121 compatible via SM120 binary compat)
  • FP8 KV cache is not compatible with GDN non-causal attention layers; use --kv-cache-dtype auto
  • NVFP4 is not supported on SM121 (missing cvt.e2m1x2 instruction)
  • Runtime FP8 (--quantization fp8) is not compatible with DFlash (drafter inherits FP8 config and crashes)
  • --language-model-only skips vision encoder profiling for text-only inference
  • --performance-mode interactivity enables latency-optimized CUDA graphs and kernels
  • Clear page cache before starting on UMA: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Safety Warning

This model has safety filtering removed (abliterated) and may generate inappropriate content. Users are solely responsible for all consequences arising from its use. Please ensure usage complies with local laws and ethical standards.

Credits


繁體中文

huihui-ai/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated 的 INT4 AutoRound 量化版本,針對 NVIDIA DGX Spark (GB10 SM121) 最佳化,使用 Marlin INT4 kernel 加速。

模型資訊

項目 數值
架構 MoE(35B 總參數, 3B 活躍)+ GDN (Mamba) + Attention 混合
基礎模型 Qwen/Qwen3.5-35B-A3B
微調者 huihui-ai(Claude 4.6 Opus 蒸餾 + abliteration)
量化者 YuYu1015
模型大小 ~23 GB(原版 BF16 約 70 GB)
Context 長度 最高 65,536 tokens(受限於 128GB 統一記憶體上的 BF16 KV cache)
思考模式 支援(enable_thinking: true/false
工具呼叫 支援(qwen3_coder parser)
MTP 內建 MTP 權重

量化詳情

項目 數值
方法 Intel AutoRound v0.12.2
位元數 4
Group size 128
格式 auto_round(GPTQ 相容)
迭代次數 200
校準樣本數 512
校準序列長度 2048
量化硬體 NVIDIA DGX Spark(GB10, 128GB 統一記憶體)

保留 BF16 的層

以下層未被量化以保持模型品質:

原因
lm_head 輸出頭,對量化雜訊敏感
embed_tokens 輸入嵌入(因 shape 自動排除)
shared_expert.* 共享專家權重,處理每個 token
shared_expert_gate 共享專家路由門
mlp.gate MoE 路由門(自動排除)
linear_attn.* GDN/DeltaNet 層,量化後可能輸出零
model.visual.* 視覺編碼器(因 shape 自動排除)
mtp.* Multi-Token Prediction 權重(以 RTN INT4 複製)

效能表現

在單台 NVIDIA DGX Spark (GB10, 128GB LPDDR5X, SM121) 上實測:

配置 解碼速度 備註
INT4 + DFlash-16(日常對話) 40-60 tok/s 搭配 z-lab/Qwen3.5-35B-A3B-DFlash

投機解碼

本模型支援兩種投機解碼方式:

DFlash(需額外下載 drafter 模型):

--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.5-35B-A3B-DFlash", "num_speculative_tokens": 16}'

注意:DFlash drafter 是以原版 Qwen3.5-35B-A3B 訓練的,在 abliterated 版本上的接受率可能較原版低。

MTP(使用內建權重,不需額外模型):

--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

DGX Spark (SM121) 相容性說明

  • 使用 --quantization moe_wna16 啟用 Marlin INT4 kernel(SM121 透過 SM120 二進制相容性支援)
  • FP8 KV cache 與 GDN non-causal attention 不相容,請使用 --kv-cache-dtype auto
  • NVFP4 在 SM121 上不支援(缺少 cvt.e2m1x2 指令)
  • Runtime FP8(--quantization fp8)與 DFlash 不相容(drafter 繼承 FP8 config 導致 crash)
  • --language-model-only 跳過視覺編碼器 profiling,加速純文字推理啟動
  • --performance-mode interactivity 啟用延遲最佳化的 CUDA graphs 和 kernel
  • UMA 架構啟動前請先清除 page cache:sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

安全警告

此模型已移除安全過濾機制(abliterated),可能產生敏感、爭議性或不當內容。使用者須自行承擔所有風險與法律責任,並確保使用方式符合當地法規與倫理標準。

致謝

Downloads last month
264
Safetensors
Model size
7B params
Tensor type
I32
·
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound

Collection including YuYu1015/Huihui-Qwen3.5-35B-A3B-Claude-4.6-Opus-abliterated-int4-AutoRound