Huihui-gemma-4-E2B-it-abliterated-NVFP4

English | 繁體中文


English

NVFP4 quantization of huihui-ai/Huihui-gemma-4-E2B-it-abliterated, quantized using NVIDIA ModelOpt with NVFP4_MLP_ONLY strategy (only MLP layers quantized, attention preserved in higher precision).

Model Details

Item Value
Architecture Dense, Per-Layer Embeddings (PLE), ~2.3B effective parameters
Base model google/gemma-4-E2B-it
Fine-tuned by huihui-ai (abliteration)
Quantized by YuYu1015
Model size ~7.4 GB (NVFP4)
Context length Up to 128,000 tokens
Multimodal Vision + Audio supported

Quantization Details

Item Value
Method NVIDIA ModelOpt v0.42.0
Scheme NVFP4 (E2M1 + FP8 per-group scaling, group size 16)
Strategy NVFP4_MLP_ONLY — only MLP/FFN layers quantized, all attention layers preserved
Calibration dataset abisee/cnn_dailymail
Calibration samples 512
Hardware NVIDIA DGX Spark (GB10, 128GB unified memory)

Layers Preserved in Higher Precision

Layer Reason
self_attn.* (all layers) Attention layers preserved for accuracy (MLP_ONLY strategy)
lm_head Output head
vision_tower.* Vision encoder
audio_tower.* Audio encoder
multi_modal_projector.* Multimodal projection
embed_tokens Input embeddings

Serving with vLLM

vllm serve /path/to/model \
    --quantization modelopt \
    --served-model-name gemma-4-e2b \
    --trust-remote-code \
    --gpu-memory-utilization 0.90 \
    --max-model-len 32768 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --language-model-only

DGX Spark (SM121) Compatibility Notes

  • NVFP4 on SM121 falls back to W4A16 (native W4A4 path not available, missing cvt.e2m1x2 instruction)
  • Use --quantization modelopt (not compressed-tensors)
  • --language-model-only skips vision/audio encoder profiling for text-only inference
  • Clear page cache before starting on UMA: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Safety Warning

This model has safety filtering removed (abliterated) and may generate inappropriate content. Users are solely responsible for all consequences arising from its use.

Credits


繁體中文

huihui-ai/Huihui-gemma-4-E2B-it-abliterated 的 NVFP4 量化版本,使用 NVIDIA ModelOptNVFP4_MLP_ONLY 策略量化(僅量化 MLP 層,attention 保留高精度)。

模型資訊

項目 數值
架構 Dense,Per-Layer Embeddings (PLE),約 2.3B 有效參數
基礎模型 google/gemma-4-E2B-it
微調者 huihui-ai(abliteration)
量化者 YuYu1015
模型大小 ~7.4 GB(NVFP4)
Context 長度 最高 128,000 tokens
多模態 支援視覺 + 音訊

量化詳情

項目 數值
方法 NVIDIA ModelOpt v0.42.0
方案 NVFP4(E2M1 + FP8 逐群縮放,群組大小 16)
策略 NVFP4_MLP_ONLY — 僅量化 MLP/FFN 層,所有 attention 層保留高精度
校準資料集 abisee/cnn_dailymail
校準樣本數 512
量化硬體 NVIDIA DGX Spark(GB10, 128GB 統一記憶體)

保留高精度的層

原因
self_attn.*(所有層) Attention 層保留以確保精度(MLP_ONLY 策略)
lm_head 輸出頭
vision_tower.* 視覺編碼器
audio_tower.* 音訊編碼器
multi_modal_projector.* 多模態投影層
embed_tokens 輸入嵌入

DGX Spark (SM121) 相容性說明

  • NVFP4 在 SM121 上會退回 W4A16(原生 W4A4 路徑不可用,缺少 cvt.e2m1x2 指令)
  • 使用 --quantization modelopt(非 compressed-tensors
  • --language-model-only 跳過視覺/音訊編碼器 profiling,加速純文字推理
  • UMA 架構啟動前請先清除 page cache:sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

安全警告

此模型已移除安全過濾機制(abliterated),可能產生不當內容。使用者須自行承擔所有風險與法律責任。

致謝

Downloads last month
268
Safetensors
Model size
4B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YuYu1015/Huihui-Gemma-4-E2B-it-abliterated-NVFP4

Quantized
(6)
this model

Collection including YuYu1015/Huihui-Gemma-4-E2B-it-abliterated-NVFP4