Gemma 4 31B DECKARD HERETIC Uncensored — NVFP4 SVDQuant
SVDQuant-quantized version of DavidAU/gemma-4-31B-it-The-DECKARD-HERETIC-UNCENSORED-Thinking. Quantized using NVIDIA ModelOpt 0.42.0 with SVDQuant (SVD decomposition + NVFP4) for maximum quality at 4-bit precision. Calibrated natively on NVIDIA B200 (Blackwell SM 12.0) for hardware-accurate FP4 scale factors.
See also: AWQ_FULL variant — same model quantized with AWQ_FULL instead of SVDQuant.
What is SVDQuant?
SVDQuant uses Singular Value Decomposition to separate weight matrices into two components before quantization:
- Outlier channels — high-magnitude weight channels that cause large quantization error are extracted into a low-rank BF16 residual matrix
- Cleaned weights — the remaining weights (with outliers removed) are quantized to NVFP4 (E2M1) with dramatically reduced quantization error
This produces higher quality than standard AWQ at the cost of a slightly larger model size (~20.9 GB vs ~20.5 GB) due to the low-rank residual matrices stored in BF16.
Original Weight Matrix W (BF16)
|
v
[SVD Decomposition]
|
├── Low-rank residual R (BF16, rank=32) — captures outlier channels
└── Cleaned weights W' = W - R
|
v
[NVFP4 Quantization] — much lower error without outliers
|
v
W'_quant (FP4 E2M1)
Inference: output = W'_quant @ x + R @ x
Model Details
| Property | Value |
|---|---|
| Base Model | Gemma 4 31B-it DECKARD HERETIC |
| Architecture | Gemma 4 (Dense, 31B parameters) |
| Layers | 60 |
| Max Context | 131,072 tokens |
| Hidden Size | 5376 |
| Intermediate Size | 21,504 |
| Attention Heads | 32 (16 KV heads) |
| Vocabulary | 262,144 tokens |
| Quantization | NVFP4 SVDQuant (ModelOpt format) |
| SVD Low-Rank | 32 |
| Model Size | ~20.9 GB |
| Calibration Hardware | NVIDIA B200 (native Blackwell FP4) |
Quantization Details
Gemma 4 31B DECKARD HERETIC (BF16, ~62 GB)
|
v
[NVFP4 SVDQuant on B200]
- ModelOpt 0.42.0 with NVFP4_SVDQUANT_DEFAULT_CFG
- Low-rank = 32 (BF16 residual matrices)
- 2048 calibration samples (CNN DailyMail)
- Native Blackwell FP4 hardware calibration (SM 12.0)
- Excluded: vision tower, embed_vision, multi_modal_projector
- Quantization time: ~69 minutes on B200
|
v
Gemma-4-31B-DECKARD-HERETIC-NVFP4-SVDQuant (~20.9 GB)
AWQ_FULL vs SVDQuant Comparison
| Metric | AWQ_FULL | SVDQuant |
|---|---|---|
| Technique | Channel scaling + clipping optimization | SVD decomposition + low-rank residual |
| Model Size | ~20.5 GB | ~20.9 GB |
| Quant Time | ~75 min | ~69 min |
| Quality | Excellent | Potentially higher (preserves outliers in BF16) |
| Speed | Slightly faster (smaller) | Slightly slower (low-rank matmul overhead) |
| Best For | Maximum throughput | Maximum quality |
Deployment
vLLM
vllm serve /path/to/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4-SVDQuant \
--served-model-name deckard-svdquant \
--quantization modelopt \
--dtype auto \
--kv-cache-dtype fp8 \
--max-model-len 65536 \
--max-num-seqs 8 \
--gpu-memory-utilization 0.85 \
--trust-remote-code \
--enable-chunked-prefill \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser gemma4 \
--reasoning-parser gemma4
Docker Compose
services:
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia
ports:
- "8000:8000"
volumes:
- ./models:/models
environment:
- NVIDIA_VISIBLE_DEVICES=all
- VLLM_ATTENTION_BACKEND=FLASHINFER
command: >
--model /models/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4-SVDQuant
--served-model-name deckard-svdquant
--quantization modelopt
--dtype auto
--kv-cache-dtype fp8
--max-model-len 65536
--max-num-seqs 8
--gpu-memory-utilization 0.85
--trust-remote-code
--enable-chunked-prefill
--enable-prefix-caching
--enable-auto-tool-choice
--tool-call-parser gemma4
--reasoning-parser gemma4
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
DGX Spark Performance Estimates
| Configuration | Estimated tok/s |
|---|---|
| BF16 (no quantization) | ~3-5 |
| NVFP4 AWQ_FULL | ~12-14 |
| NVFP4 SVDQuant | ~10-13 |
Key Deployment Flags
| Flag | Purpose |
|---|---|
--quantization modelopt |
Required — tells vLLM to use ModelOpt NVFP4 format |
--kv-cache-dtype fp8 |
Reduces KV cache memory by 2x for longer contexts |
--reasoning-parser gemma4 |
Extracts <think> blocks for thinking/reasoning display |
--tool-call-parser gemma4 |
Enables native function calling |
--enable-chunked-prefill |
Processes long prompts in chunks to avoid OOM |
--enable-prefix-caching |
Caches common prompt prefixes for faster responses |
Related Models
GitHub repo: AEON-7/Gemma-4-31B-DECKARD-HERETIC-Uncensored-NVFP4 — deployment docs, Docker Compose, cross-model comparison
AWQ_FULL variant: AEON-7/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4 — same base model, AWQ_FULL quantization
Gemma 4 MoE NVFP4: AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4 — MoE variant, faster throughput
Base model: DavidAU/gemma-4-31B-it-The-DECKARD-HERETIC-UNCENSORED-Thinking
Advanced Techniques
Native B200 Calibration
Quantized on NVIDIA B200 with native FP4 hardware instructions (SM 12.0). The SVDQuant calibration measures actual FP4 rounding behavior on real hardware rather than simulating it, producing more accurate scale factors and SVD decomposition decisions than calibrating on non-FP4 hardware.
SVD Low-Rank Selection
The default low-rank of 32 was used, which provides an excellent balance between quality preservation and model size overhead. Each weight matrix has a 32-column BF16 residual that captures the most important outlier channels, keeping them at full precision while the remaining weights are safely quantized to FP4.
License
This model inherits the Gemma license from the base model.
Legal Disclaimer
THIS MODEL IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND. The authors make no representations regarding accuracy, reliability, or fitness for any purpose. Use at your own risk. By downloading or using this model, you agree that the authors shall not be liable for any claims, damages, or losses arising from its use.
- Downloads last month
- 319
Model tree for AEON-7/Gemma-4-31B-it-DECKARD-HERETIC-Uncensored-NVFP4-SVDQuant
Base model
google/gemma-4-31B-it