Huihui-gemma-4-31B-it-abliterated-v2-NVFP4

This repository contains an NVFP4-compressed version of huihui-ai/Huihui-gemma-4-31B-it-abliterated-v2.

The goal of this build is to produce an NVFP4 checkpoint that actually fits on a single 32 GB RTX 5090 while preserving the multimodal pipeline and the abliteration intervention. The existing NVIDIA reference release keeps every self_attn layer in BF16, which pushes the on-disk weights past 32 GB and makes single-GPU serving impossible. This build quantizes self_attn as well, so only the vision tower, embeddings, and lm_head remain in BF16.

What Changed In v2 (vs v1)

The source model (huihui-ai/...-abliterated-v2) differs from v1 in one key way: the first 5 text layers (layers 0–4) are no longer abliterated. They retain the original google/gemma-4-31B-it weights. Layers 5–59 are still abliterated but with a refusal direction recomputed excluding those early layers.

Per the source model card, this produces lower perplexity (better quality) while maintaining the same level of refusal removal:

Model PPL Gap vs base
google/gemma-4-31B-it (base) 14874.75
v1 abliterated 13335.55 -1539
v2 abliterated 13161.29 -1713 (v1 − 174)

In local side-by-side testing of the NVFP4 quantized versions, v2 showed tighter instruction following (stricter JSON output, more efficient reasoning steps within the same token budget) while retaining identical abliteration effectiveness and decode throughput (~69 tok/s).

Source And References

Supported Modalities

This checkpoint matches the input/output capabilities of the upstream google/gemma-4-31B-it exactly — no modality was added or removed by quantization.

Modality Status Notes
Text → Text ✅ Supported Verified locally across English, Chinese, reasoning, code, long-form, and JSON-constrained outputs
Image → Text ✅ Supported Vision tower (27-layer SigLIP-style, ~550M params) preserved in BF16 via the re:.*vision.* ignore pattern. Verified locally with multiple real images at species-level accuracy
Video → Text ✅ Supported Uses the same vision tower with the Gemma4VideoProcessor sampling 32 frames at 70 soft-tokens per frame. Verified locally by posting a real 5-second MP4 through the vLLM chat/completions endpoint with a video_url content part
Audio → Text ❌ Not supported The 31B dense checkpoint has no audio encoder. config.json has audio_config: null and there are zero audio-related tensors in the safetensors. Per the Gemma 4 model card, audio ASR/translation is available only on the E2B and E4B variants. This is a property of the upstream weights, not the quantization

A Note On The any-to-any Tag Seen On Some Related Repos

The huihui source repository carries an any-to-any tag among its Hub tags. That tag does not reflect the actual capabilities of the 31B weights — the upstream google/gemma-4-31B-it is tagged as image-text-to-text and its model card explicitly restricts audio support to the E2B/E4B variants. The Gemma4Processor in processor_config.json does include a Gemma4AudioFeatureExtractor section (because the Gemma4ForConditionalGeneration architecture class is a superset capable of hosting audio), but the 31B checkpoint ships without audio encoder weights. This release intentionally does not carry an any-to-any tag, to avoid propagating that mislabel.

What Was Quantized

  • Every Linear layer in the text stack — including self_attn.{q,k,v,o}_proj and all mlp.*_proj — was quantized to NVFP4 with llm-compressor.
  • The vision tower, all embed_* tensors, the lm_head, and any audio_* modules remain in BF16 so the multimodal pipeline and vocabulary pathways are preserved.
  • Calibration used mit-han-lab/pile-val-backup, 128 samples at 1024 tokens per sample, batch size 1.
  • Quantization recipe (from recipe.yaml):
    QuantizationModifier:
      targets: [Linear]
      ignore: ['re:.*vision.*', 're:.*audio.*', lm_head, 're:.*embed.*']
      scheme: NVFP4
    

Ignore List Compared To The NVIDIA Reference

NVIDIA Gemma-4-31B-IT-NVFP4 This build
MLP Linear layers NVFP4 NVFP4
self_attn.* Linear layers BF16 (excluded) NVFP4
Vision tower BF16 (excluded) BF16 (excluded)
embed_* BF16 (excluded) BF16 (excluded)
lm_head BF16 (excluded) BF16 (excluded)
On-disk size ~32.6 GB ~19.5 GB
Single-GPU 32 GB fit No Yes

Inference

As tested locally, this model works with the vllm/vllm-openai:gemma4-cu130 image on an RTX 5090 (32 GB). Representative launch via docker compose:

services:
  vllm:
    image: vllm/vllm-openai:gemma4-cu130
    network_mode: host
    ipc: host
    devices:
      - nvidia.com/gpu=all
    volumes:
      - ./Huihui-gemma-4-31B-it-abliterated-v2-NVFP4:/models/huihui-gemma4-nvfp4:ro
    environment:
      - VLLM_NVFP4_GEMM_BACKEND=marlin
      - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
    command:
      - /models/huihui-gemma4-nvfp4
      - --served-model-name=huihui-gemma4-31b-nvfp4
      - --max-model-len=102400
      - --gpu-memory-utilization=0.95
      - --kv-cache-dtype=fp8
      - --trust-remote-code
      - --enable-prefix-caching
      - --enable-auto-tool-choice
      - --tool-call-parser=gemma4

Observed on RTX 5090 at max-model-len=102400, gpu-memory-utilization=0.95:

  • ~29.7 GB VRAM occupied (weights + KV cache + CUDA runtime)
  • ~69 tokens/second single-stream decode throughput
  • Available KV cache: 8.55 GiB, maximum concurrency for 102,400 tokens: 1.15x
  • Engine init (profile + KV cache + warmup): ~51 seconds
  • Tool calling enabled with native gemma4 parser

Validation

A local quality sweep was run against the served OpenAI-compatible endpoint. All tests passed.

Text (7 / 7)

  • English coherence with sentence-count constraint
  • Chinese generation with first-sentence geographic-location constraint
  • Multi-step arithmetic (two trains converging, correct reasoning chain)
  • Strict JSON output — parsed cleanly by json.loads, output was bare JSON without markdown fencing
  • 400-word long-form narrative generation — 21 / 21 unique sentences, no degeneration
  • Python code generation with assert-based test cases
  • Abliteration smoke test: the refusal-vector intervention from the source model is preserved through the NVFP4 round-trip

Image (4 / 4)

Posted real JPEGs via the image_url content part:

Subject Model output Verdict
Black Labrador puppy "The image shows a black dog."
Walruses on a beach "Several walruses are lounging and resting on a sandy beach. Overcast and cool, hazy grey sky." ✅ species-level
Pug wrapped in blanket "Dog (Pug), Blanket, Bed/Bedding" ✅ breed-level
Grey-tone cat portrait "grays... whites/off-whites... soft pinks/browns on the nose"

Video (1 / 1)

Posted a real 5-second MP4 (~2.85 MB) via the video_url content part. prompt_tokens = 2436, consistent with 32 frames × 70 soft-tokens per frame plus the text wrapper.

Model output: "A white truck and several cars are parked along a road bordering a lush, green park. The scene features tall trees, a grassy area with a path, and a park bench."

Spatial and object-level description are correct. End-to-end latency was ~2 seconds.

v1 vs v2 Comparison (NVFP4)

Both versions were tested with the same quality battery on the same hardware:

Test v1 v2 Notes
English coherence ✅ 31 tok/s ✅ 36 tok/s v2 slightly more articulate
Chinese ✅ 69 tok/s ✅ 69 tok/s Equivalent quality
Math reasoning Correct chain, truncated before final answer Correct chain, reached "1h 21min" conclusion v2 more token-efficient
JSON output Pretty-printed with fencing Bare single-line JSON v2 stricter instruction following
Long-form story 27/27 unique sentences 21/21 unique sentences Both zero degeneration
Code Correct Correct Near-identical
Abliteration Zero refusal Zero refusal Both fully preserved

Formal Benchmarks

Intentionally omitted for now. They will be added later once a reproducible evaluation harness run is available.

Notes

  • Architecture: Gemma4ForConditionalGeneration
  • Model type: gemma4
  • Pipeline tag: image-text-to-text
  • Quantization format: nvfp4-pack-quantized (compressed-tensors)
  • KV cache precision at inference time (recommended): fp8
  • Requires transformers >= 5.5.0 to load gemma4 configurations
  • The multimodal processing files (processor_config.json, chat_template.jinja) are preserved so vision input still works; text-only usage also works unchanged
Downloads last month
1,650
Safetensors
Model size
18B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lyf/Huihui-gemma-4-31B-it-abliterated-v2-NVFP4

Quantized
(9)
this model

Dataset used to train lyf/Huihui-gemma-4-31B-it-abliterated-v2-NVFP4