Huihui-gemma-4-31B-it-abliterated-v2-NVFP4

This repository contains an NVFP4-compressed version of huihui-ai/Huihui-gemma-4-31B-it-abliterated-v2.

The goal of this build is to produce an NVFP4 checkpoint that actually fits on a single 32 GB RTX 5090 while preserving the multimodal pipeline and the abliteration intervention. The existing NVIDIA reference release keeps every self_attn layer in BF16, which pushes the on-disk weights past 32 GB and makes single-GPU serving impossible. This build quantizes self_attn as well, so only the vision tower, embeddings, and lm_head remain in BF16.

What Changed In v2 (vs v1)

The source model (huihui-ai/...-abliterated-v2) differs from v1 in one key way: the first 5 text layers (layers 0–4) are no longer abliterated. They retain the original google/gemma-4-31B-it weights. Layers 5–59 are still abliterated but with a refusal direction recomputed excluding those early layers.

Per the source model card, this produces lower perplexity (better quality) while maintaining the same level of refusal removal:

Model	PPL	Gap vs base
google/gemma-4-31B-it (base)	14874.75	—
v1 abliterated	13335.55	-1539
v2 abliterated	13161.29	-1713 (v1 − 174)

In local side-by-side testing of the NVFP4 quantized versions, v2 showed tighter instruction following (stricter JSON output, more efficient reasoning steps within the same token budget) while retaining identical abliteration effectiveness and decode throughput (~69 tok/s).

Source And References

Source model: huihui-ai/Huihui-gemma-4-31B-it-abliterated-v2
Upstream base model: google/gemma-4-31B-it
NVIDIA reference release (conservative ignore list): nvidia/Gemma-4-31B-IT-NVFP4
RedHatAI reference release: RedHatAI/gemma-4-31B-it-NVFP4
Quantization library: vllm-project/llm-compressor (main branch, examples/multimodal_vision/gemma4_example.py as starting template)
vLLM build used for validation: vllm/vllm-openai:gemma4-cu130

Supported Modalities

This checkpoint matches the input/output capabilities of the upstream google/gemma-4-31B-it exactly — no modality was added or removed by quantization.

Modality	Status	Notes
Text → Text	✅ Supported	Verified locally across English, Chinese, reasoning, code, long-form, and JSON-constrained outputs
Image → Text	✅ Supported	Vision tower (27-layer SigLIP-style, ~550M params) preserved in BF16 via the `re:.vision.` ignore pattern. Verified locally with multiple real images at species-level accuracy
Video → Text	✅ Supported	Uses the same vision tower with the `Gemma4VideoProcessor` sampling 32 frames at 70 soft-tokens per frame. Verified locally by posting a real 5-second MP4 through the vLLM `chat/completions` endpoint with a `video_url` content part
Audio → Text	❌ Not supported	The 31B dense checkpoint has no audio encoder. `config.json` has `audio_config: null` and there are zero audio-related tensors in the safetensors. Per the Gemma 4 model card, audio ASR/translation is available only on the E2B and E4B variants. This is a property of the upstream weights, not the quantization

A Note On The `any-to-any` Tag Seen On Some Related Repos

The huihui source repository carries an any-to-any tag among its Hub tags. That tag does not reflect the actual capabilities of the 31B weights — the upstream google/gemma-4-31B-it is tagged as image-text-to-text and its model card explicitly restricts audio support to the E2B/E4B variants. The Gemma4Processor in processor_config.json does include a Gemma4AudioFeatureExtractor section (because the Gemma4ForConditionalGeneration architecture class is a superset capable of hosting audio), but the 31B checkpoint ships without audio encoder weights. This release intentionally does not carry an any-to-any tag, to avoid propagating that mislabel.

What Was Quantized

Every Linear layer in the text stack — including self_attn.{q,k,v,o}_proj and all mlp.*_proj — was quantized to NVFP4 with llm-compressor.
The vision tower, all embed_* tensors, the lm_head, and any audio_* modules remain in BF16 so the multimodal pipeline and vocabulary pathways are preserved.
Calibration used mit-han-lab/pile-val-backup, 128 samples at 1024 tokens per sample, batch size 1.

Quantization recipe (from recipe.yaml):

QuantizationModifier:
  targets: [Linear]
  ignore: ['re:.*vision.*', 're:.*audio.*', lm_head, 're:.*embed.*']
  scheme: NVFP4

Ignore List Compared To The NVIDIA Reference

	NVIDIA `Gemma-4-31B-IT-NVFP4`	This build
MLP Linear layers	NVFP4	NVFP4
`self_attn.*` Linear layers	BF16 (excluded)	NVFP4
Vision tower	BF16 (excluded)	BF16 (excluded)
`embed_*`	BF16 (excluded)	BF16 (excluded)
`lm_head`	BF16 (excluded)	BF16 (excluded)
On-disk size	~32.6 GB	~19.5 GB
Single-GPU 32 GB fit	No	Yes

Inference

As tested locally, this model works with the vllm/vllm-openai:gemma4-cu130 image on an RTX 5090 (32 GB). Representative launch via docker compose:

services:
  vllm:
    image: vllm/vllm-openai:gemma4-cu130
    network_mode: host
    ipc: host
    devices:
      - nvidia.com/gpu=all
    volumes:
      - ./Huihui-gemma-4-31B-it-abliterated-v2-NVFP4:/models/huihui-gemma4-nvfp4:ro
    environment:
      - VLLM_NVFP4_GEMM_BACKEND=marlin
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    command:
      - /models/huihui-gemma4-nvfp4
      - --served-model-name=huihui-gemma4-31b-nvfp4
      - --max-model-len=102400
      - --gpu-memory-utilization=0.95
      - --kv-cache-dtype=fp8
      - --trust-remote-code
      - --enable-prefix-caching
      - --enable-auto-tool-choice
      - --tool-call-parser=gemma4

Observed on RTX 5090 at max-model-len=102400, gpu-memory-utilization=0.95:

~29.7 GB VRAM occupied (weights + KV cache + CUDA runtime)
~69 tokens/second single-stream decode throughput
Available KV cache: 8.55 GiB, maximum concurrency for 102,400 tokens: 1.15x
Engine init (profile + KV cache + warmup): ~51 seconds
Tool calling enabled with native gemma4 parser

Validation

A local quality sweep was run against the served OpenAI-compatible endpoint. All tests passed.

Text (7 / 7)

English coherence with sentence-count constraint
Chinese generation with first-sentence geographic-location constraint
Multi-step arithmetic (two trains converging, correct reasoning chain)
Strict JSON output — parsed cleanly by json.loads, output was bare JSON without markdown fencing
400-word long-form narrative generation — 21 / 21 unique sentences, no degeneration
Python code generation with assert-based test cases
Abliteration smoke test: the refusal-vector intervention from the source model is preserved through the NVFP4 round-trip

Image (4 / 4)

Posted real JPEGs via the image_url content part:

Subject	Model output	Verdict
Black Labrador puppy	"The image shows a black dog."	✅
Walruses on a beach	"Several walruses are lounging and resting on a sandy beach. Overcast and cool, hazy grey sky."	✅ species-level
Pug wrapped in blanket	"Dog (Pug), Blanket, Bed/Bedding"	✅ breed-level
Grey-tone cat portrait	"grays... whites/off-whites... soft pinks/browns on the nose"	✅

Video (1 / 1)

Posted a real 5-second MP4 (~2.85 MB) via the video_url content part. prompt_tokens = 2436, consistent with 32 frames × 70 soft-tokens per frame plus the text wrapper.

Model output: "A white truck and several cars are parked along a road bordering a lush, green park. The scene features tall trees, a grassy area with a path, and a park bench."

Spatial and object-level description are correct. End-to-end latency was ~2 seconds.

v1 vs v2 Comparison (NVFP4)

Both versions were tested with the same quality battery on the same hardware:

Test	v1	v2	Notes
English coherence	✅ 31 tok/s	✅ 36 tok/s	v2 slightly more articulate
Chinese	✅ 69 tok/s	✅ 69 tok/s	Equivalent quality
Math reasoning	Correct chain, truncated before final answer	Correct chain, reached "1h 21min" conclusion	v2 more token-efficient
JSON output	Pretty-printed with fencing	Bare single-line JSON	v2 stricter instruction following
Long-form story	27/27 unique sentences	21/21 unique sentences	Both zero degeneration
Code	Correct	Correct	Near-identical
Abliteration	Zero refusal	Zero refusal	Both fully preserved

Formal Benchmarks

Intentionally omitted for now. They will be added later once a reproducible evaluation harness run is available.

Notes

Architecture: Gemma4ForConditionalGeneration
Model type: gemma4
Pipeline tag: image-text-to-text
Quantization format: nvfp4-pack-quantized (compressed-tensors)
KV cache precision at inference time (recommended): fp8
Requires transformers >= 5.5.0 to load gemma4 configurations
The multimodal processing files (processor_config.json, chat_template.jinja) are preserved so vision input still works; text-only usage also works unchanged

Downloads last month: 1,650

Safetensors

Model size

18B params

Tensor type

F32

BF16

F8_E4M3

Model tree for lyf/Huihui-gemma-4-31B-it-abliterated-v2-NVFP4

Base model

google/gemma-4-31B-it

Finetuned

huihui-ai/Huihui-gemma-4-31B-it-abliterated-v2

Quantized

(9)

this model

lyf
/

Huihui-gemma-4-31B-it-abliterated-v2-NVFP4

Huihui-gemma-4-31B-it-abliterated-v2-NVFP4

What Changed In v2 (vs v1)

Source And References

Supported Modalities

A Note On The `any-to-any` Tag Seen On Some Related Repos

What Was Quantized

Ignore List Compared To The NVIDIA Reference

Inference

Validation

Text (7 / 7)

Image (4 / 4)

Video (1 / 1)

v1 vs v2 Comparison (NVFP4)

Formal Benchmarks

Notes

Model tree for lyf/Huihui-gemma-4-31B-it-abliterated-v2-NVFP4

Dataset used to train lyf/Huihui-gemma-4-31B-it-abliterated-v2-NVFP4

Huihui-gemma-4-31B-it-abliterated-v2-NVFP4

What Changed In v2 (vs v1)

Source And References

Supported Modalities

A Note On The any-to-any Tag Seen On Some Related Repos

What Was Quantized

Ignore List Compared To The NVIDIA Reference

Inference

Validation

Text (7 / 7)

Image (4 / 4)

Video (1 / 1)

v1 vs v2 Comparison (NVFP4)

Formal Benchmarks

Notes

Model tree for lyf/Huihui-gemma-4-31B-it-abliterated-v2-NVFP4

Dataset used to train lyf/Huihui-gemma-4-31B-it-abliterated-v2-NVFP4

A Note On The `any-to-any` Tag Seen On Some Related Repos