Huihui-gemma-4-E2B-it-abliterated-NVFP4

English

NVFP4 quantization of huihui-ai/Huihui-gemma-4-E2B-it-abliterated, quantized using NVIDIA ModelOpt with NVFP4_MLP_ONLY strategy (only MLP layers quantized, attention preserved in higher precision).

Model Details

Item	Value
Architecture	Dense, Per-Layer Embeddings (PLE), ~2.3B effective parameters
Base model	google/gemma-4-E2B-it
Fine-tuned by	huihui-ai (abliteration)
Quantized by	YuYu1015
Model size	~7.4 GB (NVFP4)
Context length	Up to 128,000 tokens
Multimodal	Vision + Audio supported

Quantization Details

Item	Value
Method	NVIDIA ModelOpt v0.42.0
Scheme	NVFP4 (E2M1 + FP8 per-group scaling, group size 16)
Strategy	`NVFP4_MLP_ONLY` — only MLP/FFN layers quantized, all attention layers preserved
Calibration dataset	abisee/cnn_dailymail
Calibration samples	512
Hardware	NVIDIA DGX Spark (GB10, 128GB unified memory)

Layers Preserved in Higher Precision

Layer	Reason
`self_attn.*` (all layers)	Attention layers preserved for accuracy (MLP_ONLY strategy)
`lm_head`	Output head
`vision_tower.*`	Vision encoder
`audio_tower.*`	Audio encoder
`multi_modal_projector.*`	Multimodal projection
`embed_tokens`	Input embeddings

Serving with vLLM

vllm serve /path/to/model \
    --quantization modelopt \
    --served-model-name gemma-4-e2b \
    --trust-remote-code \
    --gpu-memory-utilization 0.90 \
    --max-model-len 32768 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --language-model-only

DGX Spark (SM121) Compatibility Notes

NVFP4 on SM121 falls back to W4A16 (native W4A4 path not available, missing cvt.e2m1x2 instruction)
Use --quantization modelopt (not compressed-tensors)
--language-model-only skips vision/audio encoder profiling for text-only inference
Clear page cache before starting on UMA: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Safety Warning

This model has safety filtering removed (abliterated) and may generate inappropriate content. Users are solely responsible for all consequences arising from its use.

Credits

Original Model: google/gemma-4-E2B-it by Google DeepMind
Abliteration: huihui-ai
NVFP4 Quantization: YuYu1015 on NVIDIA DGX Spark (GB10)
Quantization Tool: NVIDIA ModelOpt

繁體中文

huihui-ai/Huihui-gemma-4-E2B-it-abliterated 的 NVFP4 量化版本，使用 NVIDIA ModelOpt 的 NVFP4_MLP_ONLY 策略量化（僅量化 MLP 層，attention 保留高精度）。

模型資訊

項目	數值
架構	Dense，Per-Layer Embeddings (PLE)，約 2.3B 有效參數
基礎模型	google/gemma-4-E2B-it
微調者	huihui-ai（abliteration）
量化者	YuYu1015
模型大小	~7.4 GB（NVFP4）
Context 長度	最高 128,000 tokens
多模態	支援視覺 + 音訊

量化詳情

項目	數值
方法	NVIDIA ModelOpt v0.42.0
方案	NVFP4（E2M1 + FP8 逐群縮放，群組大小 16）
策略	`NVFP4_MLP_ONLY` — 僅量化 MLP/FFN 層，所有 attention 層保留高精度
校準資料集	abisee/cnn_dailymail
校準樣本數	512
量化硬體	NVIDIA DGX Spark（GB10, 128GB 統一記憶體）

保留高精度的層

層	原因
`self_attn.*`（所有層）	Attention 層保留以確保精度（MLP_ONLY 策略）
`lm_head`	輸出頭
`vision_tower.*`	視覺編碼器
`audio_tower.*`	音訊編碼器
`multi_modal_projector.*`	多模態投影層
`embed_tokens`	輸入嵌入

DGX Spark (SM121) 相容性說明

NVFP4 在 SM121 上會退回 W4A16（原生 W4A4 路徑不可用，缺少 cvt.e2m1x2 指令）
使用 --quantization modelopt（非 compressed-tensors）
--language-model-only 跳過視覺/音訊編碼器 profiling，加速純文字推理
UMA 架構啟動前請先清除 page cache：sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

安全警告

此模型已移除安全過濾機制（abliterated），可能產生不當內容。使用者須自行承擔所有風險與法律責任。

致謝

原始模型：google/gemma-4-E2B-it，Google DeepMind
去審查：huihui-ai
NVFP4 量化：YuYu1015，於 NVIDIA DGX Spark (GB10) 上完成
量化工具：NVIDIA ModelOpt

Downloads last month: 268

Safetensors

Model size

4B params

Tensor type

BF16

F8_E4M3

Model tree for YuYu1015/Huihui-Gemma-4-E2B-it-abliterated-NVFP4

Base model

google/gemma-4-E2B-it

Finetuned

huihui-ai/Huihui-gemma-4-E2B-it-abliterated

Quantized

(6)

this model

Collection including YuYu1015/Huihui-Gemma-4-E2B-it-abliterated-NVFP4

Gemma-4-abliterated

Collection

2 items • Updated 11 days ago