Huihui-gemma-4-31B-it-abliterated-v2-NVFP4
This repository contains an NVFP4-compressed version of huihui-ai/Huihui-gemma-4-31B-it-abliterated-v2.
The goal of this build is to produce an NVFP4 checkpoint that actually fits on a single 32 GB RTX 5090 while preserving the multimodal pipeline and the abliteration intervention. The existing NVIDIA reference release keeps every self_attn layer in BF16, which pushes the on-disk weights past 32 GB and makes single-GPU serving impossible. This build quantizes self_attn as well, so only the vision tower, embeddings, and lm_head remain in BF16.
What Changed In v2 (vs v1)
The source model (huihui-ai/...-abliterated-v2) differs from v1 in one key way: the first 5 text layers (layers 0–4) are no longer abliterated. They retain the original google/gemma-4-31B-it weights. Layers 5–59 are still abliterated but with a refusal direction recomputed excluding those early layers.
Per the source model card, this produces lower perplexity (better quality) while maintaining the same level of refusal removal:
| Model | PPL | Gap vs base |
|---|---|---|
| google/gemma-4-31B-it (base) | 14874.75 | — |
| v1 abliterated | 13335.55 | -1539 |
| v2 abliterated | 13161.29 | -1713 (v1 − 174) |
In local side-by-side testing of the NVFP4 quantized versions, v2 showed tighter instruction following (stricter JSON output, more efficient reasoning steps within the same token budget) while retaining identical abliteration effectiveness and decode throughput (~69 tok/s).
Source And References
- Source model: huihui-ai/Huihui-gemma-4-31B-it-abliterated-v2
- Upstream base model: google/gemma-4-31B-it
- NVIDIA reference release (conservative ignore list): nvidia/Gemma-4-31B-IT-NVFP4
- RedHatAI reference release: RedHatAI/gemma-4-31B-it-NVFP4
- Quantization library: vllm-project/llm-compressor (main branch,
examples/multimodal_vision/gemma4_example.pyas starting template) - vLLM build used for validation:
vllm/vllm-openai:gemma4-cu130
Supported Modalities
This checkpoint matches the input/output capabilities of the upstream google/gemma-4-31B-it exactly — no modality was added or removed by quantization.
| Modality | Status | Notes |
|---|---|---|
| Text → Text | ✅ Supported | Verified locally across English, Chinese, reasoning, code, long-form, and JSON-constrained outputs |
| Image → Text | ✅ Supported | Vision tower (27-layer SigLIP-style, ~550M params) preserved in BF16 via the re:.*vision.* ignore pattern. Verified locally with multiple real images at species-level accuracy |
| Video → Text | ✅ Supported | Uses the same vision tower with the Gemma4VideoProcessor sampling 32 frames at 70 soft-tokens per frame. Verified locally by posting a real 5-second MP4 through the vLLM chat/completions endpoint with a video_url content part |
| Audio → Text | ❌ Not supported | The 31B dense checkpoint has no audio encoder. config.json has audio_config: null and there are zero audio-related tensors in the safetensors. Per the Gemma 4 model card, audio ASR/translation is available only on the E2B and E4B variants. This is a property of the upstream weights, not the quantization |
A Note On The any-to-any Tag Seen On Some Related Repos
The huihui source repository carries an any-to-any tag among its Hub tags. That tag does not reflect the actual capabilities of the 31B weights — the upstream google/gemma-4-31B-it is tagged as image-text-to-text and its model card explicitly restricts audio support to the E2B/E4B variants. The Gemma4Processor in processor_config.json does include a Gemma4AudioFeatureExtractor section (because the Gemma4ForConditionalGeneration architecture class is a superset capable of hosting audio), but the 31B checkpoint ships without audio encoder weights. This release intentionally does not carry an any-to-any tag, to avoid propagating that mislabel.
What Was Quantized
- Every
Linearlayer in the text stack — includingself_attn.{q,k,v,o}_projand allmlp.*_proj— was quantized to NVFP4 withllm-compressor. - The vision tower, all
embed_*tensors, thelm_head, and anyaudio_*modules remain in BF16 so the multimodal pipeline and vocabulary pathways are preserved. - Calibration used
mit-han-lab/pile-val-backup, 128 samples at 1024 tokens per sample, batch size 1. - Quantization recipe (from
recipe.yaml):QuantizationModifier: targets: [Linear] ignore: ['re:.*vision.*', 're:.*audio.*', lm_head, 're:.*embed.*'] scheme: NVFP4
Ignore List Compared To The NVIDIA Reference
NVIDIA Gemma-4-31B-IT-NVFP4 |
This build | |
|---|---|---|
| MLP Linear layers | NVFP4 | NVFP4 |
self_attn.* Linear layers |
BF16 (excluded) | NVFP4 |
| Vision tower | BF16 (excluded) | BF16 (excluded) |
embed_* |
BF16 (excluded) | BF16 (excluded) |
lm_head |
BF16 (excluded) | BF16 (excluded) |
| On-disk size | ~32.6 GB | ~19.5 GB |
| Single-GPU 32 GB fit | No | Yes |
Inference
As tested locally, this model works with the vllm/vllm-openai:gemma4-cu130 image on an RTX 5090 (32 GB). Representative launch via docker compose:
services:
vllm:
image: vllm/vllm-openai:gemma4-cu130
network_mode: host
ipc: host
devices:
- nvidia.com/gpu=all
volumes:
- ./Huihui-gemma-4-31B-it-abliterated-v2-NVFP4:/models/huihui-gemma4-nvfp4:ro
environment:
- VLLM_NVFP4_GEMM_BACKEND=marlin
- PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
command:
- /models/huihui-gemma4-nvfp4
- --served-model-name=huihui-gemma4-31b-nvfp4
- --max-model-len=102400
- --gpu-memory-utilization=0.95
- --kv-cache-dtype=fp8
- --trust-remote-code
- --enable-prefix-caching
- --enable-auto-tool-choice
- --tool-call-parser=gemma4
Observed on RTX 5090 at max-model-len=102400, gpu-memory-utilization=0.95:
- ~29.7 GB VRAM occupied (weights + KV cache + CUDA runtime)
- ~69 tokens/second single-stream decode throughput
- Available KV cache: 8.55 GiB, maximum concurrency for 102,400 tokens: 1.15x
- Engine init (profile + KV cache + warmup): ~51 seconds
- Tool calling enabled with native
gemma4parser
Validation
A local quality sweep was run against the served OpenAI-compatible endpoint. All tests passed.
Text (7 / 7)
- English coherence with sentence-count constraint
- Chinese generation with first-sentence geographic-location constraint
- Multi-step arithmetic (two trains converging, correct reasoning chain)
- Strict JSON output — parsed cleanly by
json.loads, output was bare JSON without markdown fencing - 400-word long-form narrative generation — 21 / 21 unique sentences, no degeneration
- Python code generation with assert-based test cases
- Abliteration smoke test: the refusal-vector intervention from the source model is preserved through the NVFP4 round-trip
Image (4 / 4)
Posted real JPEGs via the image_url content part:
| Subject | Model output | Verdict |
|---|---|---|
| Black Labrador puppy | "The image shows a black dog." | ✅ |
| Walruses on a beach | "Several walruses are lounging and resting on a sandy beach. Overcast and cool, hazy grey sky." | ✅ species-level |
| Pug wrapped in blanket | "Dog (Pug), Blanket, Bed/Bedding" | ✅ breed-level |
| Grey-tone cat portrait | "grays... whites/off-whites... soft pinks/browns on the nose" | ✅ |
Video (1 / 1)
Posted a real 5-second MP4 (~2.85 MB) via the video_url content part. prompt_tokens = 2436, consistent with 32 frames × 70 soft-tokens per frame plus the text wrapper.
Model output: "A white truck and several cars are parked along a road bordering a lush, green park. The scene features tall trees, a grassy area with a path, and a park bench."
Spatial and object-level description are correct. End-to-end latency was ~2 seconds.
v1 vs v2 Comparison (NVFP4)
Both versions were tested with the same quality battery on the same hardware:
| Test | v1 | v2 | Notes |
|---|---|---|---|
| English coherence | ✅ 31 tok/s | ✅ 36 tok/s | v2 slightly more articulate |
| Chinese | ✅ 69 tok/s | ✅ 69 tok/s | Equivalent quality |
| Math reasoning | Correct chain, truncated before final answer | Correct chain, reached "1h 21min" conclusion | v2 more token-efficient |
| JSON output | Pretty-printed with fencing | Bare single-line JSON | v2 stricter instruction following |
| Long-form story | 27/27 unique sentences | 21/21 unique sentences | Both zero degeneration |
| Code | Correct | Correct | Near-identical |
| Abliteration | Zero refusal | Zero refusal | Both fully preserved |
Formal Benchmarks
Intentionally omitted for now. They will be added later once a reproducible evaluation harness run is available.
Notes
- Architecture:
Gemma4ForConditionalGeneration - Model type:
gemma4 - Pipeline tag:
image-text-to-text - Quantization format:
nvfp4-pack-quantized(compressed-tensors) - KV cache precision at inference time (recommended):
fp8 - Requires
transformers >= 5.5.0to loadgemma4configurations - The multimodal processing files (
processor_config.json,chat_template.jinja) are preserved so vision input still works; text-only usage also works unchanged
- Downloads last month
- 1,650
Model tree for lyf/Huihui-gemma-4-31B-it-abliterated-v2-NVFP4
Base model
google/gemma-4-31B-it